There have been a few attempts at solving the internationalization problem. However there doesn't seem to be a solution that works worldwide. This is due in part to the disparity in the standards across the globe. http://en.wikipedia.org/wiki/List_of_country_calling_codes
There are 9 zones and I have only figured out Zone 1 – North American Numbering Plan Area - which happens to be the zone I live in and was pretty straight forward to implement. I took a shot at solving it here #696994.
However as aspope suggested there may be a need to have separate implementations for each zone. Or some other user customizable method. Any method chosen I believe should have the capability to be truly international, meaning, a site can be setup where users from ALL regions can register - and not limiting to a particular region at a time.
There has been mention of the Phone CCK module, from my research, this seems to only allow validation from one region at a time and not all regions are supported.
With regards to validation - I had suggested in my above implementation an input mask - this could however cause the sms_validate_number function to be null and void. It however can create a fixed input solution that works - it seems to work pretty well for Zone 1.
What I would like to get is suggestions from people who understand how the areacode/countrycode system works for the other zones [2 - 9] and describe how numbers are qualified and logical steps of how to acheive it.

Comments
Excellent discussion opener
We have several methods that we can use to handle numbers, and can use a combination of tools.
An international function currently exists in 6.x-2.x-dev, where the function tries to match the number to a country by taking the first 4,3,2,1 digits of the number and matching to the list of international calling codes. Even though the code is not very mature, it works very well and I am rewriting it into 6.x-1.x-dev.
Anyway, food for thought...
Number length
One of the major validation issues I was having was the length of the telephone numbers. Sometimes you can figure out country codes and such but the length sometimes vary, does the current International code handle that well?
Number length
The overall length of the number should be the same (if people are playing by the rules), but you sometimes see variations in length based on mobile networks doing things like appending extra digits, so there isn't really a hard and fast rule across all of the numbering plans, you'd probably have to build in some sort of exception handling to let site owners configure things to suit their own markets? Also the length of the country code part of the number can vary, but is not usually more than three digits.
When you get into validation and whether a number is actually a valid mobile number that the framework is allowed to use, it gets messier still. Here's an example from the UK range: mobile numbers are in the 7x range - but 70xxx is for personal numbering services, not mobiles. 76xxx is only a valid mobile number for 07624, but only if you want to allow SMS to the Isle of Man - the rest of the 076 range is supposed to be for pagers. There is also the complication that in some markets - like the UK - a non-mobile number (e.g. a landline) may be a valid SMS recipient and shouldn't be rejected "out of hand" just because it doesn't fit the pattern.
WOW, sigh, see what I mean.
WOW, sigh, see what I mean. Its a bit scary.
Perhaps the SMS International will have to be phased, overtime including more and more regions. Because I can just imagine that this isn't a distinct case.
Also point to note - does validation mean that it should validate whether its a mobile number or not? That can get real messy for some regions and would require intimate knowledge of the cellphone providers ever changing range of numbers.
There then remains the question of should we go for a 1 size fits all solution or a per zone/region/country customizable solution.
re: WOW ...
maybe a pluggable template - I don't think we can do a "one size fits all", but we probably can do:
1. First pass validation extracts country info & tidies up (remove spaces or any non-numerics, convert to a common MSISDN format if needed), identifies if country is allowed destination
2. Optional second pass validation does additional country-specific stuff if required (e.g. stripping-out non-US numbers from NANP if the site is US only, etc)
3. Optional third pass validation strips out premium rate numbers, or other unwanteds
This way only the second and third, both optional, steps require a real knowledge of local rules - and they could be added later? The same approach would also enable the later addition of number-sensitive gateway selection, if the core framework supports multiple gateways then 3rd-pass can tag a number with the relevant gateway instructions. That is a far more ambitious setup, which can - and almost certainly should - wait, but the overall structure would allow that kind of extensibility ...... things like least cost routing, subscriber balance tracking / management, could all be added as plugins.
Hmm I like that concept-
Hmm I like that concept- three level validation!! And being able to enable them at the granularity that you need is perfect! Now we need to figure out a customization method to allow for user friendly validation - perhaps requiring SOME basic knowledge of php or regular expressions. Or we could create a high level validation method.
multi-tier validation
really, the first tier could be quite simple - extract the country code, provide a UI that lets users select allowed countries.
Level 2 you could I suppose do a very simple "regex-light" - for example, for the UK with some really simple rules (! for dont send, & for hand off to another chain, x for one or more trailing digits)
1x ## this is OK, for SMS to landline
2x ## ditto
!3x ## don't want people sending SMS to customer services lines etc
!4x ## no thanks, reserved number range that isn't used, why would you want to text to it?
!5x ## decided we don't want SMS to VoIP numbers at this stage
!6x ## definitely don't want people sending SMS to premium rate lines
!8x ## why would you want to send a text to a free phone, marketing lines or similar
!9x ## definitely definitely don't want people sending SMS to adult service numbers
&7x ## this gets messy, hand it off to another script, use a naming convention to work out what it's called (so eg. a set of rules called 447x would be called if you had &7x in your UK rules)
Level 3 example could then be
75x ## this is ok, valid UK mobile
7624 ## this is ok, valid Isle of Man mobile within the numbering plan
!76x ## this is not ok, they are pager numbers
77x ## this is ok, valid UK mobile
78x ## this is ok, valid UK mobile
!7911x ## this may or may not be ok, WiFi phone service numbers?
79x ## this is ok, valid UK mobile
!7x ## drop all others as they're not in mobile-assigned ranges
People can then adapt rules really quickly for their own country, and we can pool them as templates for people who don't want/need to do anything special or clever. For performance reasons this very simple "language" could then be turned into proper regex or other method, but it achieves a simple way for end users to set their own rules? The pseudo-language would only need to be interpreted whenever it changes, although you could end up with a fairly huge set of regexes I guess - except how many people are genuinely going to be handling international numbers to any significant degree?
regex
@frazras, @andybill: great stuff. I really like how the user can customize as little or as much as they want.
To be able to enter these filter rules is a really powerful feature, and I think that it could be even more powerful by allowing for regex. However, I would recommend enforcing only very basic regex features to mitigate any possible ReDoS vulnerabilities.
When you have multiple levels of number validation, how do you imagine that lower levels are called? Would it be how @andybill suggested; letting a matched rule above pass down to a lower chain?
Flat hierarchy suggestion
Humour me on this: it may be simpler (at least from coding perspective) if it were a more flat hierarchy like:
1. Validate country, split country code, clean up number.
2. Validate rest of number. Using @andybill's example for UK (+44):
[default]!1
2
3!
7!
75
76!
7624
77
78
79
7911!
8!
9!
The parser would find the list of rules that match the number, and would evaluate each one in order.
So, In the +447624000000 case, the parser would evaluate like:
0. +44 matches country UK and is enabled for outbound messaging
1. +44* is denied by default
2. +447* is denied
3. +4476* is denied
4. +44762* unhandled (same as the +4476* rule)
5. +447624* is allowed
6. There are no more rules that match this number. Last result is positive, so number is valid.
Regex should never be required if we are only ever matching the first digits of the number. I also think that we can employ a few UI tricks to make rule editing easier (eg: when I type "!" the line goes red).
What do you think are the pro's/con's of each hierarchy method?
@aspope, @frazras
a late-night comment, will re-read properly tomorrow, but I would always lean towards default-deny: have the explicit passes first, then anything that falls through is denied. We really need to "play safe" with it and not have a default of valid - someone somewhere will find a way to make a dishonest $ on that. It's nicer to have your easier to read list (with comments, maybe, for users to explain any defaults?) .. but a default valid is asking for trouble at some point. It may be paranoid but think of it like writing firewall rules - you need to stop the abuse before the potential for it becomes apparent?
I'd envisaged an explicit hand-off to another rule chain (in my poorly described example an ampersand says "if it matches this, give it to him over there to process", with the matched number being the chain to call). The reason for that is if you have a complex setup but it is not likely to be matched very often, you can take it out of the main rule chain at one stroke. An example that comes to mind is handling the 1- international destinations: the main chain shouldn't have to bother sorting out Bahamas from Bermuda from Boston from Baie Verte if the site owner doesn't want 1- numbers at all, then they have 1!, and if they do, then they have 1& ? You could then (getting ambitious again - told u it was late) do a 'macro' that represents pre-identified codes, so 1&USA followed by 1! would make US numbers ok but not other 1- numbers? Well, logically, that 'macro' should be the "other way up": there are more valid US codes in NANP than non-US codes - although I'm not sure if that holds true for mobile codes.
Another example would be handling +2 codes: if you want to send to countries within Africa, firstly you could just block +29 outright (Greenland, Faroe Is) then hand-off as valid? Except St Helena, on +290 might well be acceptable in some cases, whereas Aruba on +297 is kinda wrong - easier to have a way to say "hand the 29s off to that rule over there, we're so hardly ever going to come across them anyway, and we don't want the overhead of processing them unless we do" ?
One final thought for anyone who is wondering why Aruba is in the +29 range - international phone numbering makes zero sense anymore. It will give you a headache if you try to figure it out.
Do we see this having any
Do we see this having any possible impact on the CCK Phone field - perhaps even a possible integration? If so we could get that community to share their ideas by starting an issue in the project queue. Even though this is limited to mobile, some of these concepts could be ported there as well.
CCK Phone
There is a possible interaction; we're trying to validate more than Phone CCK is though (I've just looked at the UK rules and it would be trivial to setup a nice little business persuading Drupalers to send us each an SMS at £5.00/message premium rate). We could use the phone module to do a first-pass validation of international formats (strip out spaces and other chars, render as E.123), then pass it to a "mobile number" validation stage.
Might be interesting to look at the Asterisk stuff as well ... will do some research over next couple of days and put together something a bit more structured than last night's rambling, that covers Asterisk, CCK phone and the things we've been discussing.
Example
Just for the purposes of discussion/development I have hacked up a simple example of what we have discussed. Try it at http://drupaldev.aspope.com/?q=sms_valid and let me know what you would change or add. The validation function is very simple: it takes one argument (the number) and returns true/false, so it could easily be leveraged by other modules - I guess that CCK could use it for validation, but I don't have any experience with that.
testing using 447624123456 and 447625123456
I think this is in GB with country code 44
Trying rule 7 with prefix 7
Success on rule 7
Trying rule 74! with prefix 74
Trying rule 75 with prefix 75
Trying rule 7624 with prefix 7624
Success on rule 7624
Trying rule 76! with prefix 76
Explicit failure on rule 76!
Validation FAILED for 447624123456
That should've passed validation? Could it be "=true" for first successful match, and then exit the function before the explicit deny?
Tried again using 446725123456
I think this is in GB with country code 44
Trying rule 7 with prefix 7
Success on rule 7
Trying rule 74! with prefix 74
Trying rule 75 with prefix 75
Trying rule 7624 with prefix 7624
Trying rule 77 with prefix 77
Trying rule with prefix
Success on rule
Validation success for 447625123456
Shoudn't that fail due to default deny?
bugs in my hack-job
@andybill, thanks for testing.
In your first example you were caught by the fact that the (dumb) parser is looking at every rule in the order they were written. I will fix this tonight so that more relevant prefixes (eg: prefixes that match more digits) are evaluated more highly.
Your second example is a bit of a mystery - the "7" rule should not match 4467. This must be an error in my regex.
I'll have another working version before tomorrow, but also let me know if you would like anything else to be added.
tried again
removed the 7 rule, so the rules are now:
74!
75
7624
77
Both of the example numbers are passed as valid (the one that begins 7625 shouldn't match, unless the regex is only doing first few digits?)
Ah - might have spotted it. I had a blank line underneath the rules (where it says "Trying rule with prefix" and no more detail, it's because there are some spaces there). Deleted that, and the 7624 is ok, 7625 fail, so correct behaviour.
In terms of searching relevant prefix being more digits, wouldn't it be easier to say on first match, exit function with 'true', to avoid overhead of processing unnecessary rules lower down the list, and in the UI tell people to put more specific rules first? Alternatively when you read the rules in, read them in order of length (which will achieve what you've said about matching more digits, but wouldn't then matter if the user's order is wrong) but still do the exit on first match?
AAA
hi @aspope, @frazras,
Just working through this and thinking about SMS operations in general: it occurred to me that many of the requirements I'm coming across are actually AAA requirements rather than "just" number validation. While validation is complex enough in it's own right, there are ways to simplify AAA by breaking it down into components and I think validation may be one way of doing that? I've noticed the sms_user table has a status field: is there a list anywhere of valid status numbers, or is it a case of go through sms_user.module and pull them out? It would seem fairly simple to extend the sms_user table with a "valid" field (either numeric with codes, or just boolean), and use the approach we've discussed to discover whether numbers are valid or not when a user signs-up, but also leave the door open for another process, such as a subscription timer, to mark them as invalid or suspended; alternatively if a STOP is received, a number could be marked as invalid but retained to give the user the opportunity to go to the website and re-enable things.
I'll re-write the notes I'd done based on doing this kind of stuff as validation and instead as doing it properly as AAA, and start another thread - let's keep this one for validation only and avoid me causing confusion :)
Ok look again
I have reworked the rule parser on http://drupaldev.aspope.com/?q=sms_valid
It now parses the rule prefixes from largest to smallest, which is more efficient. I have also changed the allow/deny symbols to +/- because it feels more natural to the eye. A nice side-effect of the rewrite is that you can specify a default allow by including a line with only the "+" character.
Please take a look and let me know what you think. Cheers guys,
looks good
so far I only managed to "break" it by cheating: I tried an invalid UK number and it got through - but there is no way it could've known it to be invalid. Then an overseas number with a 44 at the front actually was validated (although it broke UK dialling plan rules). Like I said, I cheated.
Tried a US number and it correctly picked out US, and marked it as invalid, which is good; tried both SA and NL numbers and they were marked invalid. Went back and put a default + in the US settings, added a + for one area code and a - for another, worked properly: they were both landlines, but no way for it to know.
A point for UI, would it be possible to make people explicitly confirm a default "+" on save? How are you storing the rules, is it done in a way that could sustain a bulk import on module install with default settings for various jurisdictions, so we can get people to submit rules for their area?
good progress - wish I could help more with coding!
bulk imports, etc
Cheating is important - using a system in a way that which it was not intended is all part of being a hacker :-) Let me know if you think that any of your cheat numbers should be caught.
The bulk imports are important I think, and a data storage format can be easily specified for that. Do you envisage us being able to provide a suggested base of rules for each country? We can definitely have the user confirm any default-Allow behaviour before Save, and place an indicator on any rule page that has default-Allow.
So @andybill, @frazras, we have been tackling number prefixes for validation, and I cannot think of any other test conditions apart from length. I certainly don't want to hardcode any length or length-per-country values in, because of shortcode possibility and a couple of issues that I have seen from people who ran into this wall. We could make length configurable, but can you thnk of any other conditions/tests that we might implement? I should note here that there will always be a hook for external validation methods to inject their two cents into the process.
cheat numbers
Not really fair to expect it to pick them up as valid or invalid without me defining the rules in more detail that I had done: the first was a number that most people would say will be valid, the second was simply a 44 placed in front of an overseas mobile number (including the country code: the country code in question invalidated the UK number). I'll try a few more and let you know.
In terms of bulk import & rule sets, I'd envisaged asking people to contribute their rules? Use their local knowledge to build up a library: then a person or team per country / number plan can take on combining (e.g.) all UK contributions into a UK list, all US into a US list, African in Africa, etc. It might be worth considering a hierarchy, based on the nine zones and then subdividing them into countries, so the sparser zones don't need multiple "admins" and the overall picture can be assembled more quickly.
One request: is it possible (without too much effort) to create the ability to add a comment after a rule?
comments
Haha, I just wrote the per-line comment requirement on my list this morning :-)
I like the idea of splitting rulesets into zones and maintainers - this is along the lines of what @frazras had suggested originally.
Failing test
I have tried to validate JAM numbers but it keeps validating against the UK Rules.
http://screencast.com/t/ZDQyNTEyOTAt
it's not actually failing
The validation textfield requires a full international number. It matches the first digits (country code) to a country and then loads the ruleset for that country. It then validates the rest of the number against those rules. So when you enter 4463726 it validates the number against UK (44) ruleset, and when you enter 9897979 it cannot find a ruleset (no matching country code) and fails the number by default.
Sorry, the undiscovered country message should be more clear than "I think this is in with country code 0".
I envisage two more features needed for number validation:
1. Ability to set a default country, so that when a local number is entered this country ruleset will be used. (How do we identify a local number? Will it always begin with "0"? - maybe this should be configurable also.)
2. Ability to disable the country discovery, and just use a big ruleset that can also see the country code.
identifying local numbers for default ruleset selection
How about something simple to start with: Any number that begins with [prefix] (default is "0") will use the [country] ruleset. The validator will attempt to match this default number prefix before trying any country code matching.
I would also consider adding an on/off option to use the default country ruleset for any number that doesn't match a country, but this seems very unwise. We could add the option with a big warning I guess.
default country
You could do "if it begins with 0, what's my default country" and if it doesn't then start scanning for international codes as the leading digits then ask the user to confirm their country ?
Suppose it could depend on whether you want to validate a number when a site user first enters it, or whether it's a per-message / pre-send validation. If you're validating when an end user is typing in their number, can you prompt them? That'd mean the validation being called from profile / content profile; the per-message validation wouldn't matter as the number would be invalid as the country code has not been confirmed. The only real problem I can think of which that leaves is that user_import bypasses those kind of checks .....
Input mask
Well, thats why I had used the input mask, therefore once you selected a country it would automatically force you to fit a particular format - unless it doesnt work for all cases.
Input mask?
the input you wish to validate needs to have a country code - The number you tested beginning 44 naturally fell in to the UK format - because 44 is UK? I should've watched your screencast before replying, apologies.
The top half of the page aspope has given us is for rulesets & editing, as I understand it, and the validate field below is for testing against the ruleset - so you need the country code to match which ruleset it should test against?
If reply earlier was a bit cryptic, I should've opened with @aspope :) sorry. On that subject, @aspope - if possible, can you do the comment field thing: I'll try to get a sensible UK ruleset done that can be used for testing; how much would be involved in adding a +2 as a zone code, and I'll add another, messier, set for people to play with?
I retried it and it now works
I retried it and it now works - however from the experience i had with this module. Users have entered the following into the database for a site where I had disabled validation because it wasn't working for my location.
the correct format for JM is
1876xxxxxxx
so for an entry for the number 18761234567 I have seen
8761234567
1234567
18761234567
all with their own varied insertions of dashes, parenthesis and spaces. So there needs to be a way to show the user what format needs to be used.
users & formats :)
yeah, would be nice if one could rely on consistency.
in a UK context - and sorry I keep coming back to this, but hey :)
a cellular number of 07771231234 can be written as:
(0)777 123 1234
0777 123 1234
447771231234
+44(0)7771231234
07771231234
The CCKP module does a reasonably good job of regex'ing these - @aspope, could u import the regex's into your code, or are the formats too different?
The problem with showing the user what format needs to be used is the choices afaik are E164 please, MSISDN please, or we need something else to tidy them up - that somethimg being internationally consistent; as I said CCKP does a reasonably good job, but like everything else it is break-able. Then again, could we do an E164 template and "force" people to follow it?
users & formats & I forgot
and some users will be doing bulk imports of users, that bring the mobile number into the profile_values or similar tables ...
UK formats
The only valid display formats are:
+447771231234 - E.164 - should be used for storage.
+44 7771 231234 - International format for display.
07771 231234 - National format for display.
Formats with (0) are especially dangerous and should be avoided.
Users should be allowed to enter numbers in any format they want to, either in international format with country code or in national format but with an ISO 3166 country identifier for clarification.
The data should then be cleaned:
- remove any access code (+, 00, 011, etc), if present,
- remove the country code, if present, and store it for later use,
- generate the country code from the entered ISO 3166 country code, if present,
- use a default country code for national numbers where no country is specified,
- remove and separately store any extension number details,
- remove all spaces and non-digits from whatever is left,
- remove any leading 0 if present (except for Italy and one or two other countries where it should be left in place),
- check the number length is valid for the initial digit(s), and
- identify the number type from the initial digit(s).
Finally, format the number according to the formating rules for the identified number type and then display it. International format is always preferred.
Separating the validation routines from the formatting routines makes for code that is much easier to maintain.
I have generated a complete set of validation and formatting RegEx patterns for every UK number range and number type. This can be seen at:
www.aa-asterisk.org.uk/index.php/Regular_Expressions_for_Validating_and_...
The first section of the page shows patterns that can be used to identify whether the number is valid and the number type.
The second section shows the various patterns used for formatting each number range.
The code works to the formats listed at:
www.aa-asterisk.org.uk/index.php/Number_format
noting the complexities detailed in:
www.aa-asterisk.org.uk/index.php/01_numbers
www.aa-asterisk.org.uk/index.php/02_numbers
www.aa-asterisk.org.uk/index.php/Mixed_areas
and other pages.
Country code RegEx patterns
If you need a quick RegEx to decipher how many digits should be in the country code, consider these: