What is a valid email address?
With the on-going abuse to email based systems, we are in need of ways to validate the email addresses we’re handling.
We all know what an email address looks like, we see them and use them every single day. But how do you know if it’s valid or not? The next obvious question should be, what defines a valid email address?
This is what I intend on investigating.
Before you begin, I would like you make you aware of the difference between validation and verification, which is as follows:
Validation is a check to ensure it is true to the specification (eg: is the number N digits long?). Not to be confused with verification which is a check to ensure it is correct within the intended system (eg: does the number work when phoned?).
A good starting point for anyone to investigating what anything is, is Wikipedia. So, as to make this easy to follow, that’s where we’re going to start, by looking at the “E-mail address” article.
As you read the article, you’ll soon find out about the limitations and validation (not to be confused with authentication) set by the RFCs. The earliest RFC with regards to email was [RFC822], which was made obsolete by [RFC2822]. There are other RFCs you should perhaps also pay attention to which are listed in the article, however I intend on going over these later.
To fully understand how to find out what a valid email address is, we need to fully understand what an RFC is and why we need them.
An RFC (request for comments) essentially is a way in which internet developers can set standards and protocols. The RFCs we need to be focusing on are the ones relating to email, as they will tell us exactly what defines an email address as an email address. Thus in order for us to fully understand what defines an email as valid, we MUST read the RFCs.
RFCs however, aren’t easy, they are written what appears to be a mystical language, that looks like English, but it isn’t. Okay, so maybe it’s not that bad, but it isn’t exactly a straight forward task to translate it into “Plain English”.
After reading I Knew How To Validate An Email Address Until I Read The RFC and Paul Gregg’s Demonstrating why email regexs are poor, I knew this wasn’t going to be easy.
To utilise the specification written in the RFC, we need to convert it into a usable language. In this case we will be using regular expressions within PHP. This article assumes you understand PHP and regular expressions, or will at least try…
And so I decided to start translating [RFC2822] into PHP based regular expressions.
The RFC often provides binary encoded US-ASCII characters and standard characters, in most cases I will translate them to hexadecimal encoding using chr(), orc() and dechex() (eg: %d109 -> chr(109) -> m -> orc(m) -> 109 -> dechex(109) -> \\x6D).
Note: The PHP code here is for display purposes only, it may not actually work due to the changes wordpress makes to the formatting (in particular to the double quotes), if you require the proper code, it is available on request.
FROM: General Description [RFC2822 Section 2.1]
Messages are divided into lines of characters. A line is a series of
characters that is delimited with the two characters carriage-return
and line-feed; that is, the carriage return (CR) character (ASCII
value 13) followed immediately by the line feed (LF) character (ASCII
value 10). (The carriage-return/line-feed pair is usually written in
this document as “CRLF”.)
$CR = “\\x0d”;
$LF = “\\x0a”;
$CRLF = “(?:$CR$LF)”;
FROM: Primative Tokens [RFC2822 Section 3.2.1]
The following are primitive tokens referred to elsewhere in this
standard, but not otherwise defined in [http://tools.ietf.org/html/rfc2234 RFC2234]. Some of them will
not appear anywhere else in the syntax, but they are convenient to
refer to in other parts of this document.NO-WS-CTL = %d1-8 / ; US-ASCII control characters
%d11 / ; that do not include the
%d12 / ; carriage return, line feed,
%d14-31 / ; and white space characters
%d127
text = %d1-9 / ; Characters excluding CR and LF
%d11 /
%d12 /
%d14-127 /
obs-text
specials = “(” / “)” / ; Special characters used in
“<” / “>” / ; other parts of the syntax
“[” / “]” /
“:” / “;” /
“@” / “\” /
“,” / “.” /
DQUOTENo special semantics are attached to these tokens. They are simply
single characters.
$NO_WS_CTL = “[\\x01-\\x08\\x0b\\x0c\\x0e-\\x1f\\x7f]”;
$text = “[\\x01-\\x09\\x0b\\x0c\\x0e-\\x7f]”;
$DQUOTE = “\\x22”;
$specials = “[\\x28\\x29\\x3c\\x3e\\x5b\\x5d\\x3a\\x3b\\x40\\x5c\\x2c\\x2e$DQUOTE]”;
FROM: Miscellaneous obsolete tokens [RFC2822 Section 4.1]
obs-qp = “\” (%d0-127)
obs-text = *LF *CR *(obs-char *LF *CR)
obs-char = %d0-9 / %d11 / ; %d0-127 except CR and
%d12 / %d14-127 ; LF
$obs_qp = “(?:\\x5c[\\x00-\\x7f])”;
$obs_char = “[\\x00-\\x09\\x0b\\x0c\\x0e-\\x7f]”;
$obs_text = “(?:$LF*$CR*(?:$obs_char$LF*$CR*)*)”;
FROM: Structured Header Field Bodies [RFC2822 Section 2.2.2]
the space (SP, ASCII value 32) and horizontal tab (HTAB, ASCII value 9) characters
(together known as the white space characters, WSP)
$WSP = “[\\x20\\x09]”;
FROM: Obsolete folding white space [RFC2822 Section 4.2]
obs-FWS = 1*WSP *(CRLF 1*WSP)
$obs_FWS = “(?:$WSP+(?:$CRLF$WSP+)*)”;
FROM: Quoted characters [RFC2822 Section 3.2.2]
quoted-pair = (“\” text) / obs-qp
$quoted_pair = “(?:\\x5c$text|$obs_qp)”;
FROM: Folding white space and comments [RFC2822 Section 3.2.3]
FWS = ([*WSP CRLF] 1*WSP) / ; Folding white space
obs-FWS
ctext = NO-WS-CTL / ; Non white space controls%d33-39 / ; The rest of the US-ASCII
%d42-91 / ; characters not including “(“,
%d93-126 ; “)”, or “\”
ccontent = ctext / quoted-pair / comment
comment = “(” *([FWS] ccontent) [FWS] “)”
CFWS = *([FWS] comment) (([FWS] comment) / FWS)
$FWS = “(?:(?:(?:$WSP*$CRLF)?$WSP*)|$obs_FWS)”;
$ctext = “(?:$NO_WS_CTL|[\\x21-\\x27\\x2A-\\x5b\\x5d-\\x7e])”;
$ccontent = “(?:$ctext|$quoted_pair)”;
/* NOTICE: ‘ccontent’ translated only partially to avoid an infinite loop. */
$comment = “(?:\\x28((?:$FWS?(?:$ccontent|(?1)))*$FWS?\\x29))”;
$CFWS = “((?:$FWS?$comment)*(?:(?:$FWS?$comment)|$FWS))”;
FROM: Atom [RFC2822 Section 3.2.4]
atext = ALPHA / DIGIT / ; Any character except controls,
“!” / “#” / ; SP, and specials.
“$” / “%” / ; Used for atoms
“&” / “‘” /
“*” / “+” /
“-” / “/” /
“=” / “?” /
“^” / “_” /
“`” / “{” /
“|” / “}” /
“~”
atom = [CFWS] 1*atext [CFWS]
dot-atom = [CFWS] dot-atom-text [CFWS]
dot-atom-text = 1*atext *(“.” 1*atext)
$ALPHA = ‘[\\x41-\\x5a\\x61-\\x7a]’;
$DIGIT = ‘[\\x30-\\x39]’;
$atext = “(?:$ALPHA|$DIGIT|[\\x21\\x23-\\x27\\x2a\\x2b\\x2d\\x2f\\x3d\\x3f\\x5e\\x5f\\x60\\x7b-\\x7e])”;
$atom = “(?:$CFWS?$atext+$CFWS?)”;
$dot_atom_text = “(?:$atext+(?:\\x2e$atext+)*)”;
$dot_atom = “(?:$CFWS?$dot_atom_text$CFWS?)”;
FROM: Quoted strings [RFC2822 Section 3.2.5]
qtext = NO-WS-CTL / ; Non white space controls%d33 / ; The rest of the US-ASCII
%d35-91 / ; characters not including “\”
%d93-126 ; or the quote character
qcontent = qtext / quoted-pair
quoted-string = [CFWS]
DQUOTE *([FWS] qcontent) [FWS] DQUOTE
[CFWS]
$qtext = “(?:$NO_WS_CTL|[\\x21\\x23-\\x5b\\x5d-\\x7e])”;
$qcontent = “(?:$qtext|$quoted_pair)”;
$quoted_string = “(?:$CFWS?\\x22(?:$FWS?$qcontent)*$FWS?\\x22$CFWS?)”;
FROM: Miscellaneous tokens [RFC2822 Section 3.2.6]
word = atom / quoted-string
$word = “(?:$atom|$quoted_string)”;
Obsolete Addressing [http://tools.ietf.org/html/rfc2822#section-4.4 RFC2822 Section 4.4]
obs-local-part = word *(“.” word)
obs-domain = atom *(“.” atom)
$obs_local_part = “(?:$word(?:\\x2e$word)*)”;
$obs_domain = “(?:$atom(?:\\x2e$atom)*)”;
FROM: Addr-spec specification [RFC2822 Section 3.4.1]
addr-spec = local-part “@” domain
local-part = dot-atom / quoted-string / obs-local-part
domain = dot-atom / domain-literal / obs-domain
domain-literal = [CFWS] “[” *([FWS] dcontent) [FWS] “]” [CFWS]
dcontent = dtext / quoted-pair
dtext = NO-WS-CTL / ; Non white space controls%d33-90 / ; The rest of the US-ASCII
%d94-126 ; characters not including “[“,
; “]”, or “\”
$dtext = “(?:$NO_WS_CTL|[\\x21-\\x5a\\x5e-\\x7e])”;
$dcontent = “(?:$dtext|$quoted_pair)”;
$domain_literal = “(?:$CFWS?\\x5b(?:$FWS?$dcontent)*$FWS?\\x5d$CFWS?)”;
$local_part = “(?:$dot_atom|$quoted_string|$obs_local_part)”;
$domain = “(?:$dot_atom|$domain_literal|$obs_domain)”;
$addr_spec = “($local_part\\x40$domain)”;
There we have it, how to validate an email address according to [RFC2822].
However, let’s stop right there and reflect on what we have here. What we have is regular expression based on [RFC2822] that must be correct, but does it work? are there any problems? Well yes, there are some problems…
- The comments, and content of comments have an infinite loop due to possible nested comments.
- It does not appear to validate folding white space where it should.
- It does not correctly validate domain literals (IP addresses), they are simply not validated by [RFC2822], which means that IP addresses that (under current protocol) are invalid (eg: 300.300.300.300) .
- Domain names are not validated correctly either, IP addresses are allowed, when they shouldn’t be, and certain characters are allowed in places they shouldn’t, like dash (-) at the start or end of a domain name (eg: [email protected]).
- Length is no concern, email addresses can be as long as you like, much like the regex.
- There are many more RFC’s to investigate and translate before we can fully validate all parts of an email address.
- The email address validation regular expression according to [RFC2822] ALONE is almost 20,000 characters long, that’s BEFORE we look into solving these other issues.
This is simply unacceptable.
Although there are fixes and workarounds, in the form of stripping, and further validation based on other RFCs I began to feel that this wasn’t really suitable for validating real world email addresses.
Ultimately I feel that unless you’re building an mail client or an mail server sticking so strictly to the RFC (especially [RFC2822]) isn’t always going to give you the best results, in real world situations.
Look around, email addresses in the real world aren’t so strict and are far more loosely defined.
- No folding white space (FWS) – I’ve never seen a multi-line email address field for a single address.
- No comments (CFWS) – Comments simply do not belong in an email address, they can go else where.
- No quotes – When was the last time you saw quoted text in an email address?
- No IP addresses, domains only – They are only used in temporary circumstances, not live.
- No new lines – they could result in “email header injection”.
- Reasonable lengths – both parts, and the whole thing needs to be kept to a reasonable maximum length.
- The domain part doesn’t need to be so strict – We can easily verify it later using DNS.
- TLDs need to be future proof – Don’t restrict yourself to a set list. Don’t forget about IDN.
- Most RFCs are outdated, and unreliable – Remember they are technical documents for servers and clients, but not for real world situations.
- Only need to validate real world email addresses – Don’t be concerned with edge case test samples.
Hence forth, the rest of this article will concentrate on this “less strict” or “LOOSE” specification, defined by real world situations, rather than technical.
Upon going back to the drawing board I discovered [RFC3696], written by the guy who wrote [RFC2881] (SMTP). This will give us the basics of what is required for a valid email address.
[RFC3696 Section 3] entitled “Restrictions on email addresses” states:
Contemporary email addresses consist of a “local part” separated from a “domain part” (a fully-qualified domain name) by an at-sign (“@”).
We’ll look at the “local part” first.
First off, as above, we will be overlooking quoted forms.
“These quoted forms are rarely recommended, and are uncommon in practice”
We’ll ignore anything about using quotes, “real world” email addresses don’t contain quotes.
Without quotes, local-parts may consist of any combination of alphabetic characters, digits, or any of the special characters
! # $ % & ' * + - / = ? ^ _ ` . { | } ~period (“.”) may also appear, but may not be used to start or end the local part, nor may two or more consecutive periods appear.
“alphabetic characters” are “a-zA-Z”, digits are “0-9”, and special characters appear as above, in PHP based regex, the combination or “comb” for short, looks like this:
$comb = ‘[a-zA-Z0-9!#$%&\’*+\/=?^`{|}~.-]’;
You’ll notice that some of the special characters have backslashes (\) next to them, this is to “escape” them when being used as a regular expression, as they normally hold special meaning. Also the dash (-) symbol was moved to the end so that it did not act as “between”.
Putting this information together, including the bit about periods appearing in the middle, but never two together, that appears like this:
$local_part = “($comb(?:\.$comb)?)+”;
That’s the local part done. Now onto the domain part, which we’ll base on [RFC3696 Section 2].
the labels (words or strings separated by periods) that make up a domain name must consist of only the ASCII [ASCII] alphabetic and numeric characters, plus the hyphen. No other symbols or punctuation characters are permitted, nor is blank space. If the hyphen is used, it is not permitted to appear at either the beginning or end of a label. There is an additional rule that essentially requires that top-level domain names not be all- numeric.
Most internet applications that reference other hosts or systems assume they will be supplied with “fully-qualified” domain names, i.e., ones that include all of the labels leading to the root, including the TLD name. Those fully-qualified domain names are then passed to either the domain name resolution protocol itself or to the remote systems. Consequently, purported DNS names to be used in applications and to locate resources generally must contain at least one period (“.”) character.
A DNS label may be no more than 63 octets long.
Although it doesn’t say it as such in [RFC3696], we are on the understanding that periods cannot appear at the start or end of a domain name, but that is of course because periods are only used to “separate labels”.
When building this I had some issues to overcome…
- DNS labels cannot start or end with a dash (-), however two or more are allowed together in a label.
- TLDs cannot be “all numerics”, TLDs are generally all alphabetical, APART from IDN TLDs, which start with “xn--“, followed by a string of ASCII characters. This does throw a spanner in the works, however, there’s one consistency which is seen throughout, which is that all valid TLDs always start with at least 1 alphabetical character, this is what we will check for.
- TLDs are generally between 2 and 6 characters, IDN TLDs changes all this, as I have seen IDN TLDs as long as 18 characters in length, the RFC, however says 63.
- A label can be 1 character long.
Finally, we need to ensure that the length is correct. For this we need to read the [RFC3696 errata].
In addition to restrictions on syntax, there is a length limit on email addresses. That limit is a maximum of 64 characters (octets) in the "local part" (before the "@") and a maximum of 255 characters (octets) in the domain part (after the "@") for a total length of 320 characters. However, there is a restriction in RFC 2821 on the length of an address in MAIL and RCPT commands of 256 characters. Since addresses that do not fit in those fields are not normally useful, the upper limit on address lengths should normally be considered to be 256.
When it comes to dealing with lengths in regular expressions, it can often become very confusing, so I wrote this little peice of advice to refer to…
(it){X,Y} means “see it between X and Y more times”
What we need to do in terms of length is as follows:
- The “local-part” total length must be no longer than 64 characters.
- The “domain-part” total length must be no longer than 255 characters.
- Each “dns-label” total length must be no longer than 63 characters.
- The entire “email address” total length must be no longer than 256 characters.
Put this together with the fact that certain elements cannot start or end with certain characters, it makes it difficult to correctly place the end check. Here’s a run down of that:
- The “local-part” cannot start or end with a period (.)
- The “local-part” must not have two periods together
- A “dns-label” cannot start or end with a dash (-)
I found that I was unable to satisfy both the lengths and the character placements in a single regular expression. This forced me to make a decision, I could have one or the other, or neither.
I figured that lengths actually hold very little value in validation. Providing the email looks right specific lengths won’t matter. Besides, we don’t need regular expressions in order to check lengths, it’s a very simple principle. It’s also worth noting that I discovered the local part CAN be over 64 characters, check it out.
After playing around with dots and dashes in various places in email addresses on various servers and clients I soon discovered that it wasn’t as strict as I had first perceived. I found many examples of dots and dashes where they shouldn’t be, mainly at on the end of dns-labels (such as “x-.x.com”). Ultimately, at least for the “local-part”, it’s down to the user. For both parts verification should be used instead.
So now the local-part now looks like this:
$local_part = “[a-zA-Z0-9!#\$%&\’\*\+\/=\?\^_`\{\|\}~\.-]+”;
And FINALLY, the domain part looks like this:
$consists = ‘[a-zA-Z0-9][a-zA-Z0-9-]*’;
$label = “(?:$consists(?:\.$consists)?)”;
$tldlabel = “(?:[a-zA-Z][a-zA-Z0-9-]+)”;
$domain = “$label\.$tldlabel”;
We now need to bring the two parts back together, separated by an at-sign (@)…
$addr_spec=”$local_part@$domain”;
Once you’ve added the syntax to match the start and end position, the resulting regular expression, looks something like this:
/^[a-zA-Z0-9!#$%&\’*+\/=?^_`{|}~.-]+@(?:[a-zA-Z0-9][a-zA-Z0-9-]*(?:\.[a-zA-Z0-9][a-zA-Z0-9-]*)?)+\.(?:[a-zA-Z][a-zA-Z0-9-]+)$/i
I’m sure some of you have probably been shouting all the way through this saying that you can shorten the regex, I purposely didn’t do this to make it easier to follow. However you can shorten [a-zA-Z] by using the “case insensitive” modifier allowing you to remove “A-Z”, it also might be worth noting that you can use “\d” instead of “0-9”.
Here’s what I did:
$addr_spec=str_replace(‘a-zA-Z’,’a-z’,$addr_spec);
$addr_spec=str_replace(‘0-9′,’\d’,$addr_spec);
You may also wish to take it further and consider replacing “a-z\d” with “\w”, and also removing the extra “_”, since “\w” means word, which includes “a-zA-Z0-9_”.
Here’s how it looks:
/^[\w!#$%&\’*+\/=?^`{|}~.-]+@(?:[a-z\d][a-z\d-]*(?:\.[a-z\d][a-z\d-]*)?)+\.(?:[a-z][a-z\d-]+)$/i
Update: Due to recent vulnerabilities in PHP’s very own email address validation regex (FILTER_VALIDATE_EMAIL) used in the filter_var function, it’s recommended that you use the /D modifier, that will prevent newlines from matching. ie:
/^[\w!#$%&\’*+\/=?^`{|}~.-]+@(?:[a-z\d][a-z\d-]*(?:\.[a-z\d][a-z\d-]*)?)+\.(?:[a-z][a-z\d-]+)$/iD
Final thoughts
Learning how to correctly validate an email address has been one of the most stressful and time consuming things i’ve had to do in web development.
RFCs aren’t easy to understand, they are a complete minefield, and it results in something that is incomprehensible and unmaintainable.
There’s a lot that can be said for proper validation, so many people get it wrong, and it can mean the difference between a sale and no sale, but there’s a difference between doing it properly based strictly on technical specification and doing it properly for real world situations.
In order to validate correctly, you must be in touch with the real world, and not get caught up too much in the technical documentation, otherwise you will find yourself far from the original objective.
Thus a lot can be said about the outdated RFCs, and the people who write them. The technical specification is so far out of touch with reality it does not actually work in practice.
Having said all this, of course validation has it’s limitations and can only do so much. Once you’ve validated the email address to the best of your ability without compromising too much resources, verification is the next step.
This article for intent and purpose set out to validate an email address. Although basic levels of verification can be done very easily, I feel that it goes beyond the scope of this article.
For more information with regards to email address verification, I suggest you look into the Simple Mail Transfer Protocol (SMTP), details can be found in [RFC2821], you may also be interested in the getmxrr() function. Also consider the use of DNS to verify the domain name.
I hope you’ve enjoyed reading this article, it took me a long time to complete, and was quite stressful, but I feel satisfied that I am now fully qualified to validate email addresses to a satisfactory level. I hope that now, you are too.
I look forward to your comments.
Resources
- Perl’s Mail::RFC822::Address
- Cal’s is_valid_email_address PHP function
- sinful-music.com’s mime_extract_rfc2822_address
- SimonSlick’s Validate Email Address Format
- Jacob Santos’s “Stop Doing Email Validation the Wrong Way” rant.
- Validate email addresses using regular expressions
- ilovejackdaniels.com on email address validation
Update 17/12/09
I have put my regex into a function called validate_email and have created validemail.org to demonstrate the difficulty of email address validation.
Also, I am advice people not to use PHP’s filter_var() and FILTER_VALIDATE_EMAIL as according to the source code, the regex it uses is from a unmaintained PEAR package called HTML_QuickForm, which has been superseded by HTML_QuickForm2, which does not validate email addresses. This means nobody is assigned to maintaining the PHP’s own email validation.
Instead, I recommend using my validate_email function which is not only maintained but also adheres to RFC 760 which states: “In general, an implementation should be conservative in its sending behavior, and liberal in its receiving behavior”. Also known as the Robustness principle.
validate_emailvalidate_email
Warning: Declaration of Social_Walker_Comment::start_lvl(&$output, $depth, $args) should be compatible with Walker_Comment::start_lvl(&$output, $depth = 0, $args = Array) in /Users/wade/Sites/hm2k.org/wp-content/plugins/social/lib/social/walker/comment.php on line 18
Warning: Declaration of Social_Walker_Comment::end_lvl(&$output, $depth, $args) should be compatible with Walker_Comment::end_lvl(&$output, $depth = 0, $args = Array) in /Users/wade/Sites/hm2k.org/wp-content/plugins/social/lib/social/walker/comment.php on line 42
Thank you for your effort. It was a nice read and made practical sense to me.
Wow, I am quite surprised that you can give a very detailed information about valid email address. Actually, I am working to generate a function to validate email address. So, this article of yours is really helpful
Thanx
[…] Links About Email Address PHP What is a Valid Email Address? PHP:Checkdnsrr function […]
Very nice code, but it seems to me, that the following examples will still be validated wrong:
[email protected] (‘.’ is not allowed just before ‘@’)
“test \”test\” test”@test.com (‘\”‘ is allowed inside quotation)
[email protected] (‘..’ is not allowed)
Anyone who disagrees – or who can solve this?
This seems to be more right:
(?:[a-z\d!#$%&’*+/=?^_`{|}~-]+(?:\.[a-z\d!#$%&’*+/=?^_`{|}~-]+)*|”(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*”)@(?:(?:[a-z\d](?:[a-z\d-]*[a-z\d])?\.)+[a-z\d](?:[a-z\d-]*[a-z\d])?|\[(?:(?:25[0-5]|2[0-4][\d]|[01]?[\d][\d]?)\.){3}(?:25[0-5]|2[0-4][\d]|[01]?[\d][\d]?|[a-z\d-]*[a-z\d]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])
Found it at http://www.regular-expressions.info/email.html and replaced 0-9 with \d
Yes, there’s testing a regex, then there’s testing it against real life.
The reality is that ’.’ IS allowed just before ‘@’;
You will find that using quotes in an email address will simply be rejected by the mail server;
As for the double dot in the domain, that’s a separate issue that shouldn’t really be handled by this regex.
I use mailinator.com and my own setup to do test cases. This tells me if the email address is valid or not as the RFC specification is not reliable enough.
Domain names can be checked using validation, you only need a basic check.
Look at the length of the regex i’ve provided, and the regex provided by regular-expressions.info, the length is ridiculous for such a simple task.
Remember, regex alone CANNOT define a valid email address, there is no “ultimate” regex, you need to use other systems for validation.
Hope this helps.
[…] Links About Email Address PHP What is a Valid Email Address? PHP:Checkdnsrr function manual Share and […]
> Look around, email addresses in the real world aren’t so strict and are far more loosely defined.
> * No folding white space (FWS) – I’ve never seen a multi-line email address field for a single address.
> * No comments (CFWS) – Comments simply do not belong in an email address, they can go else where.
> * No quotes – When was the last time you saw quoted text in an email address?
> * No IP addresses, domains only – They are only used in temporary circumstances, not live.
> * No new lines – they could result in “email header injection”.
> * Reasonable lengths – both parts, and the whole thing needs to be kept to a reasonable maximum length.
If you disallow things that are allowed in the RFC that makes your definition more strict, not less. (stricter=fewer things allowed)
I’m of the opinion that less conditions == less strict.
My point was that things like white space and comments don’t need to be handled, not that they are disallowed.
A lack of handling means less conditions and thus less strict.
Whereas disallowing them would imply conditions, restrictions, thus being strict.
I think you misinterpreted the point I was making.
I like your real-world approach to email address validation. I have never seen an email address with quotes or backslashes in it in the real world, and since most of the email providers I know wouldn’t accept those addresses anyway, I think coding for them even if the RFCs technically allow them is unnecessary. They are edge cases that will come up so rarely as to be negligible. Definitely not worth trying to burden your server with a gigantic regexp. Kudos to you for sticking to your guns, and to what is practical!
According to sources PHP’s filter_var() and FILTER_VALIDATE_EMAIL appears to use regex from the HTML_QuickForm PEAR package:
http://svn.php.net/viewvc/php/php-src/trunk/ext/filter/logical_filters.c?view=co&content-type=text%2Fplain
http://cvs.php.net/viewvc.cgi/pear/HTML_QuickForm/QuickForm/Rule/Email.php?view=co&content-type=text%2Fplain
I still think mine provides more accurate validation for REAL email addresses.
Function provided here: http://hm2k.googlecode.com/svn/trunk/code/php/functions/validate_email.php
I have since learned that the HTML_QuickForm has package has been superseded, but is still maintained for bugs and security fixes and it’s recommended to use HTML_QuickForm2 instead.
However HTML_QuickForm2 does not validate email addresses.
I created this website to demonstrate the difficulty of email address validation: http://validemail.org/
You say the regex according to RFC 2822 was over 20,000 characters long? It hasn’t been optimized very well, then. I’ve written a regular expression which checks the syntax according to RFC 5322; the one which obsoletes RFC 2822. It allows dot-atom, quoted-string, and obsolete local-parts, domain literals (IPv4, IPv6, and IPv4-mapped IPv6), and (internationalized) domain names. It only lacks comments and folding white spaces; the latter of which I plan to include tomorrow. Plus, it takes into account the length of each part (and the entire address), even recognizing quoted pairs as one character and the quoted string double quotes as semantically invisible.
It’s 1,056 characters long. Which is quite a lot, but nothing compared to the 20,000 of which you spoke. But to make it easier, I’ve made a class to allow the user to manipulate the regex at will; allowing him or her to turn off some parts (quoted string, or domain literal, for example). That way, it is useful both for those who want just a simple validator AND for those who want to allow every valid email address (except those containing comments, for the moment).
It’s found at http://squiloople.com/2009/12/20/email-address-validation/ — I hope it’s helpful.
– Michael
“I believe erratum ID 1003 is slightly wrong. RFC 2821 places a 256 character limit on the forward-path. But a path is defined as
Path = “”
So the forward-path will contain at least a pair of angle brackets in addition to the Mailbox. This limits the Mailbox (i.e. the email address) to 254 characters.”
http://www.rfc-editor.org/errata_search.php?rfc=3696
I created a page that compares various regex’s to try and find the best. This one has passed all of my tests:
/^([\w\!\#$\%\&\’\*\+\-\/\=\?\^\`{\|\}\~]+\.)*[\w\!\#$\%\&\’\*\+\-\/\=\?\^\`{\|\}\~]+@((((([a-z0-9]{1}[a-z0-9\-]{0,62}[a-z0-9]{1})|[a-z])\.)+[a-z]{2,6})|(\d{1,3}\.){3}\d{1,3}(\:\d{1,5})?)$/i
More details are at http://fightingforalostcause.net/misc/2006/compare-email-regex.php
If you made a “proof of concept” using the is_email() function from Dominic Sayers, why isn’t this page linked at the article?
@Ast Derek
Not sure what you mean exactly. I have no control over where Dominic Sayers does or does not link to.
I’ve made several revisions to is_email() since the version you have here. The current version is 2.4 and you can download it here: http://www.dominicsayers.com/isemail/
I think Ast Derek is saying that you haven’t included is_email() in the list of resources at the foot of your article. It’s up to you whether you do that, of course.
Two other suggestions:
1. On the http://validemail.org page you might choose to give download locations for the validators you are comparing. This would allow people to get at the latest version of each validator rather the version you have hosted on your site.
2. RFC 5322 has superseded RFC 2822. It’s very similar so you shouldn’t have much trouble updating this page if you choose to do so.
Good work!
“Remember, regex alone CANNOT define a valid email address, there is no “ultimate” regex, you need to use other systems for validation.”
That’s just not true. You CAN (and I’d even say there IS).
@Dominic Sayers
1. I purposely chose to link to what I’m using, instead of websites subject to change. However I will update the copy of your script.
2. I’m aware of RFC 5322. No further updates will be added to the post itself, however I may re-address the topic in a new post at a later date.
Thanks for the feedback.
@Michael
That statement was written about the same time as RFC 5322 was published so I dare say that things have changed in that time. All that aside, you should provide evidence to support such radical statements otherwise you’re helping nobody.
/^(?!(?>(?1)\x22?\x5C?[\x00-\x7F]\x22?){255,})(?!(?>(?1)\x5C?[\x00-\x7F]){65,}(?1)@)((?>(?>(?>(?>(?>(?>\x0D\x0A)?[\x20\x09])+)?(\x28(?>(?>(?>(?>\x0D\x0A)?[\x20\x09])+)?(?>[\x01-\x08\x0B\x0C\x0E-\x1F\x21-\x27\x2A-\x5B\x5D-\x7F]|\x5C[\x00-\x7F]|(?2)))*(?>(?>(?>\x0D\x0A)?[\x20\x09])+)?\x29))+(?>(?>(?>\x0D\x0A)?[\x20\x09])+)?)|(?>(?>(?>\x0D\x0A)?[\x20\x09])+))?)(?>[\x21\x23-\x27\x2A\x2B\x2D\x2F-\x39\x3D\x3F\x5E-\x7E]+|(?>\x22(?>(?>(?>(?>\x0D\x0A)?[\x20\x09])+)?(?>[\x01-\x08\x0B\x0C\x0E-\x1F\x21\x23-\x5B\x5D-\x7F]|\x5C[\x00-\x7F]))*(?>(?>(?>\x0D\x0A)?[\x20\x09])+)?\x22))(?>(?1)\.(?1)(?>[\x21\x23-\x27\x2A\x2B\x2D\x2F-\x39\x3D\x3F\x5E-\x7E]+|(?>\x22(?>(?>(?>(?>\x0D\x0A)?[\x20\x09])+)?(?>[\x01-\x08\x0B\x0C\x0E-\x1F\x21\x23-\x5B\x5D-\x7F]|\x5C[\x00-\x7F]))*(?>(?>(?>\x0D\x0A)?[\x20\x09])+)?\x22)))*(?1)@(?>(?>(?1)\[(?:(?>IPv6:(?>(?>[a-f0-9]{1,4}(?>:[a-f0-9]{1,4}){7})|(?>(?!(?:.*[a-f0-9][:\]]){8,})(?>[a-f0-9]{1,4}(?>:[a-f0-9]{1,4}){0,6})?::(?>[a-f0-9]{1,4}(?>:[a-f0-9]{1,4}){0,6})?)))|(?>(?>IPv6:(?>(?>[a-f0-9]{1,4}(?>:[a-f0-9]{1,4}){5}:)|(?>(?!(?:.*[a-f0-9]:){6,})(?>[a-f0-9]{1,4}(?>:[a-f0-9]{1,4}){0,4})?::(?>[a-f0-9]{1,4}(?>:[a-f0-9]{1,4}){0,4}:)?)))?(?>25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])(?>\.(?>25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])){3}))\](?1))|(?>(?!.*(?1)[a-z0-9-]{64,})(?1)(?>(?>xn--)?[a-z0-9]+(?>-[a-z0-9]+)*(?1)\.(?1)){0,126}(?>xn--)?[a-z0-9]+(?>-[a-z0-9]+)*(?1)))$/isD
Allows a dot-atom (full character range) local part, quoted-string local part, obsolete local part (mixture of (dot) atoms and quoted strings), domain names, internationalized labels, domain literals (IPv4 and IPv6), folding white spaces and nested comments. 1,394 characters. Very long, but much shorter than other expressions which seek to completely verify an email address (Perl’s infamous example). Although, the fact that it’s long is irrelevant: my intention was to simply show that it can be done.
I hope this is helpful.
(Note: I’m hoping HTML tags work in comments — if not, please remove “pre” tags. Thanks).
Couldn’t you have used a pastebin or better, maintain your code in a public repository?
If you do, I’ll consider adding it into production testing.
I linked to it a few comments back. I didn’t want to spam your site with a second posting.
(It also explains the development of the regular expression and provides a class to make it easy to manipulate and use).
The date is a link to the commend…
eg: http://www.hm2k.com/posts/what-is-a-valid-email-address#comment-196432
I’ll check it out.
http://www.validemail.org has been updated.
@Michael Your script is poorly maintained, consider using revision control.
I’ve taken your advise. There’s now revision control and documentation/comments.
Yes it is helpful to have email e.g to apply omline,to chart via internet. So I will appreciate if you allow me to have an email
Thanks to tags for helping me about my email
I don’t understand any of this. Please what field of engineering is this? The only reason I surfed this page is because I keep on getting error messages stating that my e-mail [email protected], which I’ve used for donkey years, is invalid. Please help me out oh great scientists.