The Internet is a scary big place, comprising tens of billions of devices hooked together through a series of smaller networks. While most of the global network is known and mapped, there is a large and growing portion of it that is uncharted territory.
That non-trivial section of the Internet is made up of IPv6 addresses and the unknowable number of devices behind them. The major network operators began running out of unallocated IPv4 addresses about three years ago, an event that networking experts had been anticipating for about 30 years. There are about 4.3 billion IPv4 addresses and all of them have been allocated, meaning that they’re owned by some network operator or another. While many of those addresses aren’t actually in use--large companies and network operators typically have large pools of allocated but unused IP addresses--there aren’t any more of them to go around.
Like real estate, we’re not making any more IPv4 addresses. But instead of trying to colonize Mars or build cities under the sea, the Internet’s architects developed a separate address scheme with an unfathomably large pool of addresses. IPv6 has an address space of 2^128, compared to IPv4’s 2^32, and as the exhaustion of the IPv4 address space began to approach, registries started allocating IPv6 addresses and there now are billions of those addresses active at any given time. But no one really knows how many or where they are or what’s behind them or how they’re organized.
A pair of researchers decided to tackle the problem and developed a suite of tools that can find active IPv6 addresses both in the global address space and in smaller, targeted networks. Known as ipv666, the open source tool set can scan for live IPv6 hosts using a statistical model that the researchers built. The researchers, Chris Grayson and Marc Newlin, faced a number of challenges as they went about developing the ipv666 tools, including getting a large IPv6 address list, which they accumulated from several publicly available data sets. They then began the painful process of building the statistical model to predict other IPv6 addresses based on their existing list.
That may seem weird, but IPv6 addresses are nothing at all like their older cousins and come in a bizarre format that doesn’t lend itself to simple analysis or prediction. Grayson and Newlin wanted to find as many live addresses as possible and ultimately try to figure out what the security differences are between devices on IPv4 and those on IPv6.
Instead of trying to colonize Mars or build cities under the sea, the Internet’s architects developed a separate address scheme with an unfathomably large pool of addresses.
“There are devices out there on the network that are going to prefer IPv6 and your normal network firewall rules don’t apply. It seems bad. There are IPv6 ghost networks out there and we started thinking this might be kind of a perfect storm and all we have to do is find the devices,” Grayson said in an interview.
“But we didn’t know how to do that. There’s a bunch of work that’s already been done on this and people have started looking into the fact that there’s a predictable structure to these addresses.”
Grayson and Newlin initially tried a machine learning approach to the problem, without much success.
“We are by no means machine learning experts, which made this endeavour even more laughable, but alas we persisted. With the help of one of our friends we tried building some predictive models with fairly basic algorithms and in all cases the result was an overfit model that would predict the same addresses that we fed it (this is to say that the ML prediction process would take in a list of IP addresses and generate an equally long list of new addresses, but in this case the addresses that it predicted were the same as the input data set),” Grayson wrote in a blog post on the ipv666 work.
With that lesson in their back pockets, Grayson and Newlin moved on to an algorithmic approach. They began by processing the addresses in their data set and looking at the probability of one portion of an IPv6 address’s value in a certain position based on the value of the preceding portion of the address.
“In summary, we have a model that predicts on a per-nybble basis from probability distributions of what nybble values have been seen at that offset, and these probability distributions change depending on what the preceding nybble value is. These addresses are generated left-to-right and always start with 0x02. The model, when implemented in Golang, can generate 10mm addresses and write those addresses to a file in approximately 90 seconds, including some blacklisting and prior existence checks,” Grayson’s post says.
“The whole concept of blocking IP addresses is ten times more hilariously bad in IPv6.”
Their work is helped by the fact that addresses and devices on the Internet tend to stick together. So by finding one address, they’d be more likely to find others nearby, whether in physical or logical space.
“The Internet is clumpy and if you have one thing in one place, it’s very likely that you’re going to have more than one thing there. You will tend to find other things nearby,” said Robert Hansen, a security researcher and CTO of Bit Discovery who has spent considerable time looking at the structure and security of IPv6.
“By virtue of the fact that I put one thing close to another in a particular physical space I’m likely to put them close in logical space. You may own billions of IPv6 addresses but they’re all in one clump. Their algorithm helps with that clumpiness.”
After running their ipv666 scanner for about eight days, throttled to a throughput of 20 Mbps, Grayson and Newlin discovered more than 84,000 live IPv6 addresses, a large portion of which were not in the original training data set they had. But there could be many other devices hiding behind any one of those addresses, as Grayson and Newlin discovered when they picked one address range at random and pointed the scanner at it. They found about 5,000 devices in the range, which turned out to belong to an ISP.
Going forward, the researchers want to improve the address-generation algorithm and perhaps get to the point where other researchers can upload IPv6 addresses that they discover with the scanner. Grayson said they’d also like to have a look at how operators are handling the security of their IPV6 networks.
“It could be that they have an IPv6-enabled network and don’t even know about it. If so, what are the firewall rules?” Grayson said.
“There is a recurring pattern, that IPv6 is going to be more easily accessible and not as well protected. We wanted to shed some light on that. People are starting to realize that it’s here and humans are creatures of habit and if we bring the same assumptions we have now into the IPv6 world, there will be misalignments and those misalignments are what attackers will take advantage of.”
Hansen said the security implications of large-scale IPv6 deployments haven’t been looked at very carefully in a lot of cases.
“You have to be very careful about what an IPv6 address even is from a logging perspective. There’s a lot of security software out there isn’t particularly IPv6 compliant,” he said. “The whole concept of blocking IP addresses is ten times more hilariously bad in IPv6.”