{{page>public:faq}} ====For Existing Govroam Users==== ===Q: Why doesn't the test account work?=== A: Firstly, the authentication request has to be made using EAP, not just a PAP or MSCHAP request. Secondly, until Jan 2018, the Root CA on the RADIUS server at the JISC end was a generic, self-signed one, but now is signed by QuoVadis. This used to require turning off certificate checking while testing but now shouldn't. However, if you still have problems it might be worth disabling verification so see if it helps. ===Q: What's this about realm filtering?=== A: In an ideal world the only unknown requests proxied from one RADIUS server to another would be those for realms which exist elsewhere. In practice though, most of the unknown requests don't contain a valid realm. For instance, a Camford (camford.org.uk) RADIUS server would forward a request for the holby.nhs.uk realm upwards and it would eventually arrive at the Holby RADIUS server but it would also forward cmford.org.uk, cmaford.org.uk, camford.uk etc. See section 4.2.1 of the Tech Spec for more info. The number of these non-valids can exceed the valid one by factors of thousands to one (mostly because of retries from the client) and can put a significant and unnecessary load on RADIUS servers. So applying rules on the ORPS to spot and remove these as early as possible is a great idea. It's impossible to catch everything but there are a few easy rules that can be implement: - Null realms: where the end-user hasn't put a realm in. - Syntactically valid realms: - MUST contain at least one dot ("."), e.g. "camford.org.uk" is OK, - MUST NOT start or end with a dot, e.g. ".camford.org.uk" or "camford.org.uk." are both invalid, - MUST NOT have two or more sequential dots, e.g. "camford.org..uk" is invalid, - MUST consist only of alphanumeric, hyphen and dot characters, so that's a-z, 0-9, "-" and ".", spaces are explicitly not permitted, e.g. "camford.org uk" is invalid, - following on from the previous point the realm MUST NOT start or end with a space, e.g. "camford.org.uk " is invalid, - The regex "^[^@]*@[^.]+([a-zA-Z0-9-]+\.)+[a-zA-Z]{2,6}$" should match syntactically valid usernames but it's not well tested. - Misspelling of your local realms. If you know that a realm is very likely to be invalid re: cmford.org.uk then filter it out too. -.local can be used locally but should never be forwarded on. - Realms with aren't ever likely to be part of the federation: * myabc.com * 3gppnetwork.org (plus all subrealms thereof) * 3gppnetworks.org (plus all subrealms thereof) * gmail.com * googlemail.com * hotmail.com * hotmail.co.uk * live.com * outlook.com * yahoo.com * yahoo.co.uk * yahoo.cn * unimail.com In FreeRADIUS take a look at the 'filter_username' policy which covers some of the basic cases. You will want to sanity check the tests though to make sure that they're suitable for govroam. If nothing else the file will give you a template to use when writing your own filters. [[siteadmin:configuration_fragments|Code Snippets]] ===Q: What value should I use in the Operator-Name field?=== A: Normally it's the realm which the RADIUS server handles. If there are multiple realms then use the one which best identifies your site. The format is '1holby.nhs.uk'. The '1' is mandatory and defines the follow as a 'realm'. ===Q: What monitoring should I have in place?=== A: Generally that your choice and dependent on what you need from the service and what tools you have. Options: * ICMP (ping). The NRPS are pingable so you could ICMP for status checks and graphing of latency. * Test account. Jisc have provided a test account and are happy for sites to use this as an 'authentication ping'. You'll need to set up your monitoring system to be able to send out EAP auth requests but command line tools like 'eapol_test', which comes with WPA Supplicant is useful for this. Monitoring systems, such as [[https://www.nagios.org/|Nagios]], have plugins available for RADIUS status checking. * Some RADIUS servers, FreeRADIUS for example, have methods for extracting usage data. FreeRADIUS uses the 'Status Server' approach where a special request is sent and information returned (what and to where is configurable). Others use SNMP and have a custom MIB with the information. This approach is useful for monitoring your own servers as you're able to collect information about number/rate of authentication requests, successes and failures. Tools like [[https://www.observium.net|Observium]] have built-in support for FreeRADIUS and could be used to gather from generic SNMP sources too. * Processing of logs is a way of checking for problems. Checking for common failure messages, errors or server problems. Compiling or graphing statistics on various metrics (failed auths per realm, timeouts per server) can indicate systemic failure in certain areas. Tools like [[https://www.elastic.co/elk-stack|ELK]] can process logs, parse, collate and graph such logs. * If you're running a Federation then you might want/need to monitor the status and history of the sites you have connected. Many of the above approach can be used with remote RADIUS servers too. Status Server, if supported, is particularly useful for this. ===Q: Identities seem to have been replaced with 'anonymous@realm' or '@realm' and these aren't authenticating. What's going on?=== A: EAP requests have inner and outer identities. Inner identities are something like 'fred@site.org' and the username part 'fred' is used to authenticate the user. The Inner identity is contained within an encrypted tunnel and not visible other than at the client and IdP ends i.e. the proxies can not see the Inner credentials. The Outer identity is visible to all RADIUS servers and is sent unencrypted on the wire. It used only as a way of routing RADIUS requests around the infrastructure. The username portion is unused and irrelevant. i.e. from 'fred@site.org' only the 'site.org' portion is used, so that RADIUS proxies know where to send the request on to. For security reasons it's strongly recommended that clients are configured with anonymous outer identities such as '@site.org' to prevent user names from being logged or being sniffed. As such it's incumbent on sites to ensure that their RADIUS servers do not insist on the outer ID matching the inner ID or that the outer ID username part is used for any aspect of authentication or routing. To be clear, RADIUS proxies should proxy packets with outer of '@', 'anonymous@' or '@'. RADIUS IdPs should be able to authenticate based on the inner ID only. ===Q: Can I use a hardware load balancer for the RADIUS servers?=== The general advice is that it's not necessary and might not work anyway. There are two issues to consider: - There is a shared secret between each pair of RADIUS servers (ORPS/RRPS, RRPS/NRPS) so if you have 3 RRPS and 4 NRPS then that's 12 separate communication paths each with a shared secret. How does a load balancer deal with that? - Whilst basic RADIUS authentications (PAP, MSCHAP etc.) are individual UDP requests which can be load balanced quite trivially, EAP uses a 12 way handshake which is stateful. Each authentication attempt has to happen between the same pair of RADIUS servers otherwise it will just fail. Thus load balancers have to be EAP-aware and configured accordingly. Most RADIUS servers have pretty decent load balancing built in anyway so putting an extra hardware layer in place wouldn't improve resilience much, if at all. ===Q: How do I configure Windows for client certificates=== For proxying the Outer ID needs to be of the right format (@) ===Q: Why are Zombies bad?=== In RADIUS terms 'zombies' are RADIUS servers flagged as dead by another RADIUS server. RADIUS servers need to monitor the state of their neighbours to ensure that auth requests are handled. With TCP connections this could be done by simply noticing that the TCP connection fails, however with UDP this is not possible. Thus the approach is to wait a set time for a response and mark the neighbour as down (or a 'zombie') if no response is received. After another set period the neighbour is returned to the pool and the process starts again. Generally it only takes one such failed response for a server to be marked as a zombie. The response timeout is normally 30 seconds and the server would be down for five minutes. This status monitoring is per pair of servers. If one site has four RADIUS servers and their neighbour has three then that means twelve pairings. Thus only twelve failed responses in a five minute period could mean that the neighbour site is completely unavailable. Once a RADIUS server has marked all its neighbours as down and there's nowhere else to send auth requests it will respond to new auth requests with a Reject and continue to do so until there is a neighbour available. Without this monitoring process RADIUS servers wouldn't know if they're sending auth requests towards systems that had truely failed and a proportion of attempts would simply disappear, leaving the user waiting. This is why it is absolutely critical that all authentication attempts are responded to. **Every single one.** Where Jisc has a customer which is an individual member such failures as above only affect the users of that member. However, where the customer is a Federation the problem is much greater. The Federation's RADIUS servers are likely doing exactly the above status monitoring of their members and marking their neighbouring RADIUS servers as zombies where appropriate. Whilst this all might sound like everything is in hand, it's not. If a request originates at a member, let's say camford.ac.uk, and is sent via Jisc to the Scarfolk Federation for the holby.nhs.uk site then it passes through a number of RADIUS servers on the way, each doing the same status checks on their neighbours. So, let's say that the holby.nhs.uk decides not to respond to the requests. After 30s the Scarfolk Fed RADIUS servers will mark the first holby.nhs.uk RADIUS server as a zombie. BUT, and here is the real problem, Jisc's NRPS don't see a response coming back from the Scarfolk Fed's server so Jisc will mark the first Fed server as a zombie. Even Camford might then mark the first Jisc RADIUS server as down. A Federation will have multiple members and thus lots of traffic. More members means they're more likely to have a site failing to respond and thus appear to Jisc to be failing to respond. Federations are more likely to be marked as zombies. So it's imperative that Federations do everything they can do ensure that their members respond to every requests. **Every single one.** We ask that members don't do this sort of status monitoring on neighbours 'above' them in their hierarchy. i.e. Camford and Scarfolk shouldn't apply it to Jisc, and Holby shouldn't apply to to Scarfolk's RADIUS servers. However, not all RADIUS servers are capable of this selective status checking. The important message is that every RADIUS server should respond to every request, no matter what. No exceptions. No excuses.