CAIDA Internet eXchange Points (IXPs) Dataset
CAIDA互联网交换点(IXP)数据集
此数据集提供有关Internet交换点(IXP)及其地理位置、设施、前缀和成员AS的信息。它是通过结合PeeringDB、Hurricane Electric、Packet Clearning House(PCH)和GeoNames的信息得出的。
An Internet eXchange Point (IX or IXP) is a
physical infrastructure used by Internet service providers (ISPs) and
content delivery networks (CDNs) to exchange Internet traffic between their
networks (Autonomous Systems - ASes). An IXP can be distributed and
located in numerous data centers (aka facilities), and a single facility can
contain mutliple IXPs. Each IXP has a prefix, or collection of prefixes, which
are used by companies/ASes to address machines within the IXP infrastructure.
An AS connected to a given IXP is known as a member of that IXP. Internet
traffic exchange through an IXP makes use of Border Gateway Protocol (BGP)
that recognizes ISPs and CDNs by their Autonomous System Numbers (ASNs).
互联网交换点(IX或IXP)是互联网服务提供商(ISP)和内容交付网络(CDN)用于在其网络(自治系统-AS)之间交换互联网流量的物理基础设施。IXP可以分布在多个数据中心(也称为设施)中,单个设施可以包含多个IXP。每个IXP都有一个前缀或前缀集合,公司/AS使用这些前缀来寻址IXP基础架构中的计算机。连接到给定IXP的AS称为该IXP的成员。通过IXP进行的互联网流量交换使用边界网关协议(BGP),该协议通过自治系统号(ASN)识别ISP和CDN。
Sources of Data 数据来源
In order to make the most complete list of IXPs we combined information
available from the following sources:
为了制作最完整的IXP列表,我们综合了以下来源的信息:
- PeeringDB(PDB)
PeeringDB(PDB) - Hurricane Electric (HE)
Hurricane Electric(HE) - Packet Clearing House (PCH)
数据包交换所
为了排除过时的条目,我们只选择了状态为“ok”的PDB条目, 具有“活动”状态的PCH条目,以及具有活动URL的HE条目。
We also used GeoNames data
(readme,download)
to derive relevant geographic information.
我们还使用GeoNames数据(自述文件,下载)来获取相关的地理信息。
Methodology 方法
First, we downloaded the GeoNames data
set and created a local sqlite database of geographic coordinates indexed on the
name, asciiname, and alternative names of cities and villages. If we could not
find a match between the name of the city where a certain IXP is located and
any of the location strings in the database, we assigned negative geo_ids to
those IXPs.
首先,我们下载GeoNames数据集,并创建了一个本地sqlite地理坐标数据库,该数据库以城市和村庄的名称、asciiname和替代名称为索引。如果我们找不到某个IXP所在城市的名称与数据库中的任何位置字符串之间的匹配,我们将为这些IXP分配负geo_id。
Next, we tried to identify the cases when IXPs listed in the different data
sources are in fact the same. This is a non-trivial task since IXPs names,
cities and addresses could be (and are) spelled differently. We first merged
IXPs found in different sources which have the same set of prefixes. For the
remaining IXPs, we calculated the Levenshtein distance between the IXPs names.
IXPs with the names more than 4 characters long and for which the
distance was less than 2, not determined by the first or last characters of
each string, were assumed to be identical. For example, the Levenshtein
distance between "Equinix Sào Paulo" and "Equinix Sao Paulo" is 1
(one character is different in those names); therefore, we decide that
both designate the same IXP. The names "BIX" and "CIX" are also off by one
character, but they are only 3 characters long, and thus we treat them as
referring to two different IXPs. Finally, "FICIX2" and "FICIX3" are long
enough and also have only one character difference, but it is the last
character of each string, and we conclude that the "2" and "3" indicate
different IXPs.
接下来,我们尝试识别不同数据源中列出的IXP实际上相同的情况。这是一项重要的任务,因为IXP名称、城市和地址的拼写可能不同。我们首先合并了在不同来源中发现的具有相同前缀集的IXP。对于其余的IXP,我们计算了IXP名称之间的Levenshtein距离。如果IXP的名称长度超过4个字符,且距离小于2,且不取决于每个字符串的第一个或最后一个字符,则假定IXP是相同的。例如,“Equinix Sào Paulo”和“Equinix圣保罗”之间的Levenshtein距离为1(这些名称中有一个字符不同);因此,我们决定两者指定相同的IXP。名称“BIX”和“CIX”也相差一个字符,但它们只有3个字符长,因此我们将它们视为两个不同的IXP。 最后,“FICIX 2”和“FICIX 3”足够长,也只有一个字符不同,但它是每个字符串的最后一个字符,我们得出结论,“2”和“3”表示不同的IXP。
Although many IXPs are distributed across multiple facilities, only PDB
database provides detailed location information about multiple facilities for
individual IXPs. We use the PDB information directly to create facility
records with all their geographic fields (street address, zipcode, city, state,
country, and region) populated. In contrast, both PCH and HE include only a
single facility location for each IXP in their database, and typically localize
it only at the city level accuracy. Thus, we create a facility placeholder
record from PCH and HE data using the most specific geographic data these
databases provide. To populate geographic fields for an IXP record, we assign a
specific value to a given field only if this value is the same in all facilities
or facility placeholder records for this IXP.
尽管许多IXP分布在多个设施中,但只有PDB数据库为单个IXP提供关于多个设施的详细位置信息。我们直接使用PDB信息创建设施记录,并填充所有地理字段(街道地址、邮政编码、城市、州、国家和地区)。相比之下,PCH和HE在其数据库中仅包含每个IXP的单个设施位置,并且通常仅以城市级别的准确度对其进行定位。因此,我们使用这些数据库提供的最具体的地理数据从PCH和HE数据创建设施占位符记录。为了填充IXP记录的地理字段,我们仅在给定字段的值在该IXP的所有设施或设施占位符记录中相同时才为其分配特定值。
Format 格式
All files are in JSONL (JSON Lines)
format with comment lines starting with '#' and all other lines containing a
single object in JSON format. JSONL can be converted to JSON with
jsonl_to_json.py tool. All files begin
with a commented meta data line showing when the file was produced.
所有文件都是JSONL(JSON行)格式,注释行以'#'开头,所有其他行包含JSON格式的单个对象。可以使用jsonl_to_json.py工具将JSONL转换为JSON。所有文件开始都有一个注释的Meta数据行,显示文件的生成时间。
File ixs.jsonl contains information about individual IXPs. Each IXP
is assigned its own "ix_id". The "pch_id" and "pdb_id" values match the IXP ids
in the original sources, Packet Clearing House (PCH) and PeeringDB (PDB)
respectively. (IXP entries in the Hurricane Electric (HE) database do not
have a similar id field.) Among those sources, PDB is the only one that
provides organizational information. Therefore, our "org_id" values are the
same as "pdb_org_id".
jsonl文件包含关于各个IXP的信息。每个IXP都分配有自己的“ix_id”。“pch_id”和“pdb_id”值分别与原始源、分组交换所(PCH)和PeeringDB(PDB)中的IXP ID匹配。(IXP Hurricane Electric(HE)数据库中的条目没有类似的id字段。在这些来源中,PDB是唯一提供组织信息的来源。因此,我们的“org_id”值与“pdb_org_id”相同。
{ "ix_id": 3, "pch_id": 1461, "pdb_id": 639, "pdb_org_id": 8375, "name": "Calgary Internet Exchange", "alternatenames": [ "YYCIX Calgary Internet Exchange", "YYCIX" ], "geo_id": 5913490, "city": "Calgary", "state": "AB" "country": "CA", "region": "North America", "sources": [ "pdb", "pch", "he" ], "url": [ "https:\/\/www.yycix.ca\/", "http:\/\/yycix.ca" ], "prefixes": { "ipv4": [ "206.126.225.0\/24" ], "ipv6": [ "2001:504:2f::\/64" ] }, }
File facilities.jsonl contains information about individual facilities.
The "clli" value is CLLI name or a COMMON LANGUAGE Location Identifier
Code, an identifier used within the North American telecommunications
industry. Other fields are self-explanatory.
jsonl文件包含关于单个设施的信息。“clli”值是CLLI名称或通用语言位置标识符代码,这是北美电信行业中使用的标识符。其他字段是不言自明的。
{ "fac_id": 1110 "pdb_fac_id": 2410, "pdb_org_id": 12757, "name": "City of Calgary - City Hall ", "latitude": 51.04551, "longitude": -114.056326, "address": "800 Macleod Trail S.E.", "zipcode": "T2P 2M5", "state": "AB", "country": "CA", "city": "Calgary", "clli": "calgar", "sources": [ "pdb" ] }
File ix-facilites.jsonl contains mapping between facilities and IXPs.
(Note that it is "many-to-many" mapping since the same IXP can be present in
a number of facilities and a given facility can host many IXPs.) The example
below means that the IXP with ix_id value of 3 (Calgary Internet Exchange shown
in the first listing) has presence at the facility with fac_id value of 1110
(shown in the listing above). For IXPs present at multiple facilities, our
data set contains multiple records with the same ix_id and different fac_id's.
文件ix-facilites.jsonl包含设施和IXP之间的映射。(Note这是“多对多”映射,因为相同的IXP可以存在于多个设施中,并且给定的设施可以托管多个IXP。下面的示例意味着ix_id值为3的IXP(第一个清单中显示的Calgary Internet Exchange)存在于fac_id值为1110的设施中(如上面的清单所示)。对于存在于多个设施中的IXP,我们的数据集包含具有相同ix_id和不同fac_id的多个记录。
{ "ix_id": 3, "fac_id": 1110 }
File ix-asns.jsonl shows IP addresses used at a given IXP by each
member AS.
jsonl文件显示了每个成员AS在给定IXP上使用的IP地址。
{ "asn": "23467", "ipv4": [ "206.108.115.28", "206.108.115.27" ], "ipv6": [ "2001:504:38:1:0:a502:3467:2", "2001:504:38:1:0:a502:3467:1" ], "ix_id": 6 }
File organizations.jsonl contains the information about each organization
learned from PDB. These records can be linked to the corresponding facilities
records by matching their respective pdb_org_id values.
jsonl文件包含了从PDB获得的每个组织的信息。这些记录可以通过匹配它们各自的pdb_org_id值链接到相应的设施记录。
{ "org_id": 229, "pdb_org_id": 229, "name": "Init7 (Switzerland) Ltd", "address": "Technoparkstrasse 5", "zipcode": "8406", "city": "Winterthur", "state": "ZH", "country": "CH", "url": "http:\/\/www.init7.net\/", }
File locations.jsonl is similar to the geoname locations, but contains
negative "geo_id"s for those entries where geographic locations of IXPs
were not found in the geonames dataset. A full description of the fields
can be found
here.
jsonl文件与geoname位置类似,但是对于那些在geonames数据集中没有找到IXP地理位置的条目,它包含负的“geo_id“。这些字段的完整描述可以在这里找到。
{ "geo_id": 5391811, "geoname_id": 5391811, "name": "San Diego", "asciiname": "San Diego", "alternatenames": [ "davis' folly", "didacopolis", "gorad san-dyega", "graytown", "lungsod ng san diego", "new san diego", "san", "san diegas", "san diego", "san diegu", "san dijego", "san diyego", "san diy\u00e9go" ], "latitude": 32.71533, "longitude": -117.15726, "feature_class": "P", "feature_code": "PPLA2", "country_code": "US", "cc2": "", "admin1_code": "CA", "admin2_code": "073", "admin3_code": "", "admin4_code": "", "population": 1394928, "elevation": 20, "dem": 31 }
Data Access
Please read the terms of the CAIDA Acceptable Use Agreement (AUA) for Publicy Accessible Datasets below:
As required by the AUA, if you use this dataset in any publication (including but not limited to: papers, presentations, web pages, and papers published by a third party) please include the following reference:
The CAIDA UCSD IXPs Dataset, <date range used>Please report all your publications (papers, presentations, class projects, websites etc.) to CAIDA.
https://www.caida.org/catalog/datasets/ixps/
请报告您所有的出版物(论文,演讲,课堂项目,网站等)。 到CAIDA。
Related Objects
See https://catalog.caida.org/dataset/ixps/ to explore related objects to this document in the CAIDA Resource Catalog.请访问https://catalog.caida.org/dataset/ixps/,在CAIDA资源目录中浏览与本文档相关的对象。