这是用户在 2024-5-9 23:42 为 file:///C:/Users/18117/Documents/immersive-translate-dual.htm 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

PythonBiogeme: a short introduction
PythonBiogeme 简介

Michel Bierlaire

July 6,2016 2016年7月6日

Report TRANSP-OR 160706 报告 TRANSP-OR 160706Transport and Mobility Laboratory
运输与流动实验室
School of Architecture, Civil and Environmental Engineering
建筑、土木与环境工程学院
Ecole Polytechnique Fédérale de Lausanne
洛桑联邦理工学院
transp-or.epfl.ch

Series on Biogeme 生物系列
The package Biogeme (biogeme.epfl.ch) is designed to estimate the parameters of various models using maximum likelihood estimation. It is particularly designed for discrete choice models. In this document, we present step by step how to specify a simple model, estimate its parameters and interpret the output of the software package. We assume that the reader is already familiar with discrete choice models, and has successfully installed PythonBiogeme. This document has been written using PythonBiogeme 2.5, but should remain valid for future versions.
软件包 Biogeme ( biogeme.epfl.ch) 设计用于使用最大似然估计法估计各种模型的参数。它尤其适用于离散选择模型。在本文中,我们将逐步介绍如何指定一个简单模型、估计其参数并解释软件包的输出结果。我们假设读者已经熟悉离散选择模型,并已成功安装了 PythonBiogeme。本文档是使用 PythonBiogeme 2.5 编写的,但对未来版本仍然有效。

1 The data file
1 数据文件

Biogeme assumes that the data file contains in its first line a list of labels corresponding to the available data, and that each subsequent line contains the exact same number of numerical data, each row corresponding to an observation. Delimiters can be tabs or spaces. The tool biopreparedata can be used to transform a file in Comma Separated Version (CSV) into the required format. The tool biocheckdata verifies if the data file complies with the required format.
Biogeme 假定数据文件的第一行包含与可用数据相对应的标签列表,其后每行包含数量完全相同 的数字数据,每行对应一个观察结果。分隔符可以是制表符或空格。可使用 biopreparedata 工具将逗号分隔版本(CSV)文件转换为所需格式。biocheckdata 工具可验证数据文件是否符合规定格式。
The data file used for this example is swissmetro.dat. Note that the first time a data file is used by Biogeme, it is compressed and saved in binary format in a file. The name of this file is the same as the original file, preceeded by _bin. In our example, the binary file is _-bin_optima.dat. If the original text file is modifed, the binary file must be erased from the directory in order to account for the changes. The name of the file that has actually been used is reported in the output file.
本例中使用的数据文件是 swissmetro.dat。请注意,Biogeme 首次使用数据文件时会将其压缩并以二进制格式保存在一个文件中。该文件的名称与原始文件相同,前面加 _bin。在我们的例子中,二进制文件为 _-bin_optima.dat。如果修改了原始文本文件,则必须从目录中删除二进制文件,以便对修改进行说明。输出文件中会报告实际使用的文件名。
Biogeme is available in two versions. BisonBiogeme is designed to estimate the parameters of a list of predetermined discrete choice models such as logit, binary probit, nested logit, cross-nested logit, multivariate extreme value models, discrete and continuous mixtures of multivariate extreme value models, models with nonlinear utility functions, models designed for panel data, and heteroscedastic models. It is based on a formal and simple language for model specification. PythonBiogeme is designed for general purpose parametric models. The specification of the model and of the likelihood function is based on an extension of the python programming language. A series of discrete choice models are precoded for an easy use.
Biogeme 有两个版本。BisonBiogeme 设计用于估计一系列预定离散选择模型的参数,如 logit、二元 probit、嵌套 logit、交叉嵌套 logit、多元极值模型、多元极值模型的离散和连续混合物、非线性效用函数模型、为面板数据设计的模型以及异速模型。PythonBiogeme 基于正式而简单的模型规范语言。PythonBiogeme 专为通用参数模型而设计。模型和似然函数的规范基于 python 编程语言的扩展。为便于使用,对一系列离散选择模型进行了预先编码。
In this document, we describe the model specification for PythonBiogeme.
在本文档中,我们将描述 PythonBiogeme 的模型规范。

2 The model 2 模式

The model is a logit model with 3 alternatives: train, Swissmetro and car. The utility functions are defined as:
该模型为 logit 模型,有 3 种选择:火车、瑞士地铁和汽车。效用函数定义如下
V_1 = V_TRAIN = ASC_TRAIN + B_TIME * TRAIN_TT_SCALED
        + B_COST * TRAIN_COST_SCALED
V_2 = V_SM = ASC_SM + B_TIME * SM_TT_SCALED
    + B_COST * SM_COST_SCALED
V_3 = V_CAR = ASC_CAR + B_TIME * CAR_TT_SCALED
        + B_COST * CAR_CO_SCALED
where TRAIN_TT_SCALED, TRAIN_COST_SCALED, SM_TT_SCALED, SM_COST_SCALED, CAR_TT_SCALED, CAR_CO_SCALED are variables, and ASC_TRAIN, ASC_SM, ASC_CAR, B_TIME, B_COST are parameters to be estimated. Note that it is not possible to identify all alternative specific constants ASC_TRAIN, ASC_SM, ASC_CAR from data. Consequently, ASC_SM is normalized to 0 .
其中 TRAIN_TT_SCALED、TRAIN_COST_SCALED、SM_TT_SCALED、SM_COST_SCALED、CAR_TT_SCALED、CAR_CO_SCALED 为变量,ASC_TRAIN、ASC_SM、ASC_CAR、B_TIME、B_COST 为待估算参数。请注意,不可能从数据中确定所有替代的特定常数 ASC_TRAIN、ASC_SM、ASC_CAR。因此,ASC_SM 被归一化为 0。
The availability of an alternative is determined by the variable , , which is equal to 1 if the alternative is available, 0 otherwise. The probability of choosing an available alternative is given by the logit model:
替代品 的可用性由变量 , 决定,如果替代品可用,该变量等于 1,否则等于 0。选择可用替代品 的概率由 logit 模型给出:
Given a data set of observations, the log likelihood of the sample is
给定一个由 个观测值组成的数据集,样本的对数似然比为
where is the alternative actually chosen by individual .
其中 是个人实际选择的备选方案

3 Model specification: PythonBiogeme
3 模型规范:PythonBiogeme

The model specification file must have an extension .py. The file 01logit.py is reported in Section A.1. We describe here its content.
模型说明文件的扩展名为 .py。文件 01logit.py 已在第 A.1 节中报告。我们在此描述其内容。
THe objective is to provide to PythonBiogeme the formula of the log likelihood function to maximize, using a syntax based on the Python programming language, and extended for the specific needs of Biogeme. The file can contain comments, designed to document the specification. Comments are included using the characters #, consistently with the Python syntax. All characters after this command, up to the end of the current line, are ignored by PythonBiogeme. In our example, the file starts with comments describing the name of the file, its author and the date when it was created. A short description of its content is also provided.
其目的是使用基于 Python 编程语言的语法,并根据 Biogeme 的特殊需要进行扩展,向 PythonBiogeme 提供对数似然函数的最大化公式。文件可以包含注释,旨在记录规范。注释使用 # 字符,与 Python 语法一致。PythonBiogeme 会忽略该命令之后直到当前行结束的所有字符。在我们的例子中,文件开头的注释描述了文件名、作者和创建日期。此外,还提供了对文件内容的简短描述。
###############################################################################
#
# @file 01logit.py
# @author: Michel Bierlaire, EPFL
# @date: Wed Dec 21 13:23:27 2011
#
# Logit model
# Three alternatives: Train, Car and Swissmetro
# SP data
#
###############################################################################
These comments are completely ignored by PythonBiogeme. However, it is recommended to use many comments to describe the model specification, for future reference, or to help other persons to understand the specification.
PythonBiogeme 会完全忽略这些注释。不过,建议使用许多注释来描述模型规范,以便将来参考,或帮助其他人理解规范。
The specification file must start by loading the Python libraries needed by PythonBiogeme. Two libraries are mandatory biogeme and headers. The first includes the extension of the PYthon programming language needed by PythonBiogeme. The second imports the names of the headers of the data file, so that they can be directly used in the specification of the model. In this example, an additional library is loaded as well: statistics. It implements some functions that report statistics about the data file.
规范文件必须首先加载 PythonBiogeme 所需的 Python 库。有两个库是必须的,分别是 biogeme 和 headers。第一个库包括 PythonBiogeme 所需的PYthon 编程语言扩展。第二个库导入了数据文件头文件的名称,以便在规范模型时直接使用。在本例中,还加载了一个附加库:统计库。它实现了一些报告数据文件统计数据的函数。
from biogeme import *
from headers import *
from statistics import *
The next statements use the function Beta to define the parameters to be estimated. For each parameter, the following information must be mentioned:
接下来的语句将使用 Beta 函数来定义需要估算的参数。对于每个参数,必须提及以下信息:
  1. the name of the parameter,
    参数的名称、
  2. the default value, 默认值、
  3. a lower bound, 下限、
  4. an upper bound, 一个上限、
  5. a flag that indicates if the parameter must be estimated (0) or if it keeps its default value ( 1 ),
    是一个标志,表示是否必须对参数进行估计(0),还是保持默认值(1)、
  6. a description of the parameter, to be used in the report.
    参数描述,用于 报告。
Note that, in Python, case sensitivity is enforced, so that varname and Varname would represent two different variables. In our example, the default value of each parameter is 0 . If a previous estimation had been performed before, we could have used the previous estimates as default value. Note
请注意,在 Python 中,大小写敏感性是强制执行的,因此 varname 和 Varname 代表两个不同的变量。在我们的例子中,每个参数的默认值都是 0 。如果之前进行过估计,我们可以使用之前的估计值作为默认值。注意

that, for the parameters that are estimated by PythonBiogeme, the default value is used as the starting value for the optimization algorithm. For the parameters that are not estimated, the default value is used throughout the estimation process. In our example, the parameter ASC_SM is not estimated (as specified by the 1 in the fifth argument on the corresponding line), and its value is fixed to 0 . A lower bound and an upper bound must be specified. By default, we suggest to use -1000 and 1000 . If the estimated value of the parameter happens to equal to one of these bounds, it is a sign that the bounds are too tight and larger values should be provided. However, most of the time, if a coefficient reaches the value 1000 or -1000 , it means that its variable is poorly scaled, and that its units should be changed.
对于 PythonBiogeme 估算的参数,默认值将作为优化算法的起始值。而对于未估算的参数,则在整个估算过程中使用默认值。在我们的示例中,参数 ASC_SM 没有被估算(由相应行中第五个参数中的 1 指定),其值被固定为 0。必须指定下限和上限。默认情况下,我们建议使用 -1000 和 1000。如果参数的估计值恰好等于其中一个界限,则表明界限太窄,应提供更大的值。不过,在大多数情况下,如果一个系数的值达到 1000 或 -1000 ,则意味着变量的比例不合适,应改变其单位。
ASC_CAR = Beta('ASC_CAR',0,-1000,1000,0,'Car cte.')
ASC_TRAIN = Beta('ASC_TRAIN',0,-1000,1000,0,'Train cte.')
ASC_SM = Beta('ASC_SM',0,-1000,1000,1,'Swissmetro cte.')
B_TIME = Beta('B_TIME',0,-1000,1000,0,'Travel time')
B_COST = Beta('B_COST',0,-1000,1000,0,'Travel cost')
Note that none of the Python variables is used by PythonBiogeme. They are used only to simplify the writing of the formula. Therefore, nothing prevents to write
请注意,PythonBiogeme 不会使用任何 Python 变量。使用它们只是为了简化公式的书写。因此,写
car_cte = Beta('ASC_CAR',0,-1000,1000,0,'Car cte.')
and to use car_cte later in the specification. The variable car_cte will be unknown by PythonBiogeme and will not appear in any reporting file. We strongly advise against this practice, and suggest to use the exact same name for the Python variable on the left hand side, and for the PythonBiogeme variable, appearing as the first argument of the function, as illustrated in this example.
并在以后的规范中使用 car_cte。PythonBiogeme 不会知道 car_cte 这个变量,也不会在任何报告文件中出现。我们强烈建议不要采用这种做法,并建议为左侧的 Python 变量和 PythonBiogeme 变量使用完全相同的名称,作为函数的第一个参数,如本例所示。
It is possible to define new variables in addition to the variables defined in the data files. It can be done either by defining Python variables using the Python syntax:
除了数据文件中定义的变量外,还可以定义新变量。这可以通过使用 Python 语法定义 Python 变量来实现:
SM_COST  SM 成本
TRAIN_COST TRAIN_CO GA )
train_cost train_co ga )
It can also be done by defining PythonBiogeme variables, using the function DefineVariable.
也可以使用 DefineVariable 函数定义 PythonBiogeme 变量。
CAR_AV_SP = DefineVariable('CAR_AV_SP',CAR_AV * ( SP !=
0 ))
TRAIN_AV_SP = DefineVariable('TRAIN_AV_SP',TRAIN_AV * ( SP
!= 0 ))
The latter definition is equivalent to add a column with the specified header to the data file. It means that the value of the new variables for each observation is calculated once before the estimation starts. On the contrary, with the
后一种定义等同于在数据文件中添加一列指定的标题。这意味着每个观测值的新变量值都要在估算开始前计算一次。相反,使用

method based on Python variable, the calculation will be applied again and again, each time it is needed by the algorithm. For small models, it may not make any difference, and the first method may be more readable. But for models requiring a significant amount of time to be estimated, the time savings may be substantial.
如果采用基于 Python 变量的计算方法,那么每次算法需要时,都会重复计算。对于小型模型来说,这可能没有什么区别,而且第一种方法可能更具可读性。但对于需要大量时间估算的模型,节省的时间可能会非常可观。
When boolean expressions are involved, the value TRUE is represented by 1 , and the value FALSE is represented by 0 . Therefore, a multiplication involving a boolean expression is equivalent to a "AND" operator. The above code is interpreted in the following way:
涉及布尔表达式时,"真 "的值用 1 表示,"假 "的值用 0 表示。因此,涉及布尔表达式的乘法运算等同于 "AND "运算符。上述代码的解释如下:
  • CAR_AV_SP is equal to CAR_AV if SP is different from 0 , and is equal to 0 otherwise. TRAIN_AV_SP is defined similarly.
    如果 SP 与 0 不同,CAR_AV_SP 等于 CAR_AV,否则等于 0。TRAIN_AV_SP 的定义与此类似。
  • SM_CoST is equal to SM_Co if GA is equal to 0 , that is, if the traveler does not have a yearly pass (called "general abonment"). If the traveler possesses a yearly pass, then GA is different from 0 , and the variable SM_COST is zero. The variable TRAIn_COST is defined in the same way.
    如果 GA 等于 0,即如果旅客没有年票(称为 "一般放弃"),则 SM_CoST 等于 SM_Co。如果旅客持有年票,则 GA 与 0 不同,变量 SM_COST 为 0。变量 TRAIn_COST 的定义与此相同。
Variables can be also be rescaled. For numerical reasons, it is good practice to scale the data so that the values of the estimated parameters are around 1.0. A previous estimation with the unscaled data has generated parameters around -0.01 for both cost and time. Therefore, time and cost are divided by 100 .
变量也可以重新缩放。出于数字上的考虑,好的做法是对数据进行缩放,使估算参数的值在 1.0 左右。之前使用未缩放数据进行估算时,成本和时间的参数值都在 -0.01 左右。因此,时间和成本均除以 100。
TRAIN_TT_SCALED = DefineVariable('TRAIN_TT_SCALED',\
    TRAIN_TT / 100.0)
TRAIN_COST_SCALED = DefineVariable('TRAIN_COST_SCALED',\
    TRAIN_COST / 100)
SM_TT_SCALED = DefineVariable('SM_TT_SCALED', SM_TT / 100.0)
SM_COST_SCALED = DefineVariable('SM_COST_SCALED', SM_COST / 100)
CAR_TT_SCALED = DefineVariable('CAR_TT_SCALED', CAR_TT / 100)
CAR_CO_SCALED = DefineVariable('CAR_CO_SCALED', CAR_CO / 100)
We now write the specification of the utility functions.
现在我们来写出效用函数的规格。
V1 = ASC_TRAIN + \
        B_TIME * TRAIN_TT_SCALED + \
        B_COST * TRAIN_COST_SCALED
V2 = ASC_SM + \
        B_TIME * SM_TT_SCALED + \
        B_COST * SM_COST_SCALED
V3 = ASC_CAR + \
        B_TIME * CAR_TT_SCALED + \
        B_COST * CAR_CO_SCALED
We need to associate each utility function with the number of the alternative, using the numering convention in the data file. In this example, the convention is described in Table 1. To do this, we use a Python dictionary:
我们需要根据数据文件中的编号惯例,将每个效用函数与备选方案的编号联系起来。在本例中,表 1 介绍了这一约定。为此,我们使用 Python 字典:
We use also a dictionary to describe the availability conditions of each alternative:
我们还使用词典来描述每种备选方案的可用性条件:
av = {1: TRAIN_AV_SP,
    2: SM_AV,
    3: CAR_AV_SP}
Train 1 1 号列车
Swissmetro 2 瑞士地铁 2 号线
Car 3 因为 3
Table 1: Numbering of the alternatives
表 1:替代品编号
We now define the choice model. The function bioLogLogit provides the logarithm of the choice probability of the logit model. It takes three arguments:
现在我们定义选择模型。函数 bioLogLogit 提供了 logit 模型选择概率的对数。它需要三个参数:
  1. the dictionary describing the utility functions,
    描述效用函数的词典、
  2. the dictionary describing the availability conditions,
    描述可用性条件的字典、
  3. the alternative for which the probability must be calculated.
    必须计算概率的备选方案。
In this example, we obtain
在这个例子中,我们得到
logprob = bioLogLogit(V, av, CHOICE)
logprob = bioLogLogit(V,av,CHOICE)
We next defined an iterator on the data using the statement
接下来,我们使用语句
rowIterator('obsIter')
and define the ESTIMATE variable of the BIOGEME_OBJECT with the formula of the likelihood function:
并用 似然函数公式定义 BIOGEME_OBJECT 的 ESTIMATE 变量:
BIOGEME_OBJECT.ESTIMATE = Sum(10gprob,
BIOGEME_OBJECT.ESTIMATE = Sum(10gprob、
Other variables can be defined in the BIOGEME_OBJECT. In particular, the EXCLUDE variable allows to ignore some observations in the data file. It contains a boolean expression that is evaluated for each observation in the data file. Each observation such that this expression is "true" is discarded from the sample. In our example, the modeler has developed the model only for work trips, so that every observation such that the trip purpose is not 1 or 3 is removed. Observations such that the dependent variable CHOICE is 0 are
在 BIOGEME_OBJECT 中还可以定义其他变量。其中,EXCLUDE 变量可以忽略数据文件中的某些观测值。它包含一个布尔表达式,对数据文件中的每个观测值进行评估。每个表达式为 "true "的观测值都会从样本中剔除。在我们的示例中,建模者只针对工作出行建立了模型,因此,出行目的不是 1 或 3 的每个观测值都会被剔除。因变量 CHOICE 为 0 的观测值为

also removed. Remember the convention that "false" is represented by 0 , and "true" by 1 , so that the '*), can be interpreted as a "and", and the "+' as a "or". Note also that the result of the " + ' can be 2 , so that we test is the result is equal to 0 or not. The exclude condition in our example is therefore interpreted as: either (PURPOSE different from 1 and PURPOSE different from 3 ), or CHOICE equal to 0 .
也删除了。请记住,"假 "用 0 表示,"真 "用 1 表示,因此 "*) "可以解释为 "和","+"可以解释为 "或"。还要注意的是,"+"的结果可以是 2,这样我们就可以测试结果是否等于 0。因此,我们示例中的排除条件被解释为:要么(PURPOSE 不同于 1 和 PURPOSE 不同于 3 ),要么 CHOICE 等于 0 。
exclude = (( PURPOSE != 1 ) * ( PURPOSE != 3 ) + \
    ( CHOICE == 0 )) > 0
BIOGEME_OBJECT.EXCLUDE = exclude
Note that we have conveniently used an intermediary Python variable exclude in this example. It is not necessary. The above statement is completely equivalent to
请注意,我们在本例中使用了一个 Python 中间变量 exclude,这样做很方便。这并非必要。上述语句完全等同于
BIOGEME_OBJECT.EXCLUDE = \
    (( PURPOSE != 1 ) * ( PURPOSE != \
    ( CHOICE == 0 )) > 0
The variable pARAMETERS allows to define various parameters controlling the configuration of PythonBiogeme. In this example, we have selected to use the optimization algorithm BIO using the following syntax.
变量 pARAMETERS 可以定义控制 PythonBiogeme 配置的各种参数。在本例中,我们选择使用 BIO 优化算法,语法如下。
BIOGEME_OBJECT.PARAMETERS['optimizationAlgorithm'] = "BIO"
BIOGEME_OBJECT.PARAMETERS['optimizationAlgorithm'] = "BIO"。
The variable FORMULAS is used to select the parts of the model specification that are reported in the report file. In general, the formula of the log likelihood function is too complicated to be readable, and it is preferred to report only the specification of the utility functions, as in this example.
变量 FORMULAS 用于选择在报告文件中报告的模型规范部分。一般来说,对数似然函数的公式过于复杂,难以阅读,因此最好只报告效用函数的说明,本例就是如此。
BIOGEME_OBJECT.FORMULAS['Train utility'] = V1
BIOGEME_OBJECT.FORMULAS['Swissmetro utility'] = V2
BIOGEME_OBJECT.FORMULAS['Car utility'] = V3
Finally, we request PythonBiogeme to calculate some statistics about the null likelihood, the log likelihood of a model with constants only, and statistics about the availability of the alternatives.
最后,我们要求 PythonBiogeme 计算一些关于空 象的统计量、只包含常数的模型的对数象,以及关于备选方案可用性的统计量。
nullLoglikelihood(av,'obsIter')
choiceSet = [1,2,3]
cteLoglikelihood(choiceSet,CHOICE,'obsIter')
availabilityStatistics(av,'obsIter')
The function nullLoglikelihood computes the null loglikelihood from the sample and ask PythonBiogeme to include it in the output file. The first argument is a dictionary mapping each alternative ID with its availability condition. The second is an iterator on the data file. The result is the log likelihood of a model where the choice probability for observation is given
函数 nullLoglikelihood 计算样本的空对数概率,并要求 PythonBiogeme 将其包含在输出文件中。第一个参数是一个字典,它映射了每个备选 ID 及其可用性条件。第二个参数是数据文件的迭代器。结果是一个模型的对数似然,其中观察结果 的选择概率为

by is , where is the number of available alternatives, i.e.
由是 ,其中 是可用备选方案的数量,即
The function cteLoglikelihood computes the constant loglikelihood from the sample and ask PythonBiogeme to include it in the output file. It assumes that the full choice set is available for each observation. The first argument is a list containing the alternatives in the choice set. The second argument is the choice expression producing the id of the chosen alternative. The third argument is an iterator on the data file. The result is the log likelihood of a logit model where the only parameters are the alternative specific constants. If is the number of times alternative is chosen, then it is given by
函数 cteLoglikelihood 计算样本的常对数概率,并要求 PythonBiogeme 将其包含在输出文件中。它假定每个观测值都有完整的选择集。第一个参数是一个列表,包含选择集中的备选方案。第二个参数是选择表达式,产生所选备选方案的 ID。第三个参数是数据文件的迭代器。计算结果是 logit 模型的对数似然值,其中唯一的参数是备选方案的特定常数。如果 是备选方案 被选择的次数,那么它的计算公式为
where is the total number of observations.
其中 是观测数据的总数。
The function availabilityStatistics computes the number of times each alternative is declared available in the data set and ask PythonBiogeme to include it in the output file. The first argument is a dictionary containing for each alternative the expression for its availability. The second is an iterator on the data file. The result is a dictionary D with an entry D[i] for each alternative i containing the number of times it is available.
函数 availabilityStatistics 计算数据集中每个备选方案被声明为可用的次数,并要求 PythonBiogeme 将其包含在输出文件中。第一个参数是一个字典,其中包含每个备选方案的可用性表达式。第二个参数是数据文件的迭代器。结果是一个字典 D,每个备选方案 i 都有一个条目 D[i],其中包含该备选方案可用的次数。

4 Running PythonBiogeme 4 运行 PythonBiogeme

The estimation of the model is performed using the following command pythonbiogeme 01logit swissmetro.dat
使用以下命令对模型进行估计 pythonbiogeme 01logit swissmetro.dat
The following information is displayed during the execution.
执行过程中会显示以下信息。
  • Some information about the version of Biogeme.
    关于 Biogeme 版本的一些信息。
This is biogeme (pythonbiogeme) 2.5
这是 biogeme(pythonbiogeme)2.5 版
  • The name of the sample file that is read.
    读取的样本文件名称。
Read sample file: swissmetro.dat
读取样本文件:swissmetro.dat
  • PythonBiogeme is able to use several processors if they are available. By default, it uses half of the number fo available processors on the computer.
    如果处理器可用,PythonBiogeme 可以使用多个处理器。默认情况下,它会使用计算机上可用处理器数量的一半。
Nbr of cores reported by the system: 4
系统报告的内核数量:4
of cores used by biogeme: 2
生物基因组使用的核数: 2
  • The details about the iterations of the estimation procedure are reported.
    本报告详细介绍了估算程序的迭代过程。
  • The value of the parameters at the end of the iterarions.
    迭代结束时的参数值。
Estimated parameters:
ASC_CAR = -0.154633
B_TIME = -1.27786
B_COST = -1.08379
ASC_SM = 0
ASC_TRAIN = -0.701187
The following files are generated by PythonBiogeme:
以下文件由 PythonBiogeme 生成:
  • 01logit.htm1: the results of the estimation in Html format. Its content is described in Section 5.
    01logit.htm1:Html 格式的估算结果。其内容见第 5 节。
  • 01logit_param.py: the estimated value of the parameters, together with the variance-covariance matrix of the estimates, in a syntax that can be directly reused in a model specification file.
    01logit_param.py:参数的估计值以及估计值的方差-协方差矩阵,其语法可直接在模型说明文件中重复使用。
  • 01logit.log: a file containing messages produced by PythonBiogeme during the run.
    01logit.log:包含 PythonBiogeme 在运行过程中产生的信息的文件。
  • 01logit.tex: a file containing the main results in format. See Table 2.
    01logit.tex:包含 格式主要结果的文件。见表 2。
  • hess.lis: contains the final and the second derivative, or Hessian, matrix. The format is such that it can be copied and pasted in a matrix language such as Matlab or Octave.
    hess.lis:包含最终的 和二阶导数,即 Hessian 矩阵。其格式可以在 Matlab 或 Octave 等矩阵语言中复制和粘贴。
  • hessian.lis: contains the (opposite of the) Hessian matrix of the log likelihood function at each iteration, in a Matlab compatible format.
    hessian.lis:包含每次迭代时对数似然函数的(相反)黑森矩阵,格式与 Matlab 兼容。
  • _-parametersUsed.py: provides an exhaustive list of the parameters used by the run of PythonBiogeme, together with the value that has been used.
    _-parametersUsed.py:提供运行 PythonBiogeme 时所用参数的详尽列表,以及所用参数的值。
In order to avoid erasing previously generated results, the name of the files may vary from one run to the next. Therefore, PythonBiogeme explicitly mentions the name of the main files that have been generated.
为了避免擦除以前生成的结果,每次运行的文件名可能会有所不同。因此,PythonBiogeme 明确提到了已生成的主要文件的名称。
File 01logit_param.py created
File 01logit.html has been generated
File 01logit.tex has been generated
Parameter
number
Description
Coeff.
estimate
Robust
Asympt.
std. error
t-stat -value
1 Car cte. -0.155 0.0582 -2.66 0.01
2 Train cte. -0.701 0.0826 -8.49 0.00
3 Travel cost -1.08 0.0682 -15.89 0.00
4 Travel time -1.28 0.104 -12.26 0.00

Summary statistics 统计摘要

Number of observations
观察次数
Number of excluded observations
排除的观测值数量
Number of estimated parameters
估计参数数
Table 2: Results of the estimation in
表 2:中的估算结果

5 PythonBiogeme: the report file
5 PythonBiogeme:报告文件

The report file generated by PythonBiogeme gathers various information about the result of the estimation. First, some information about the version of Biogeme, and some links to relevant URLs is provided. Next, the name of the report file and the sample file are reported.
PythonBiogeme 生成的报告文件收集了有关估算结果的各种信息。首先是 Biogeme 的版本信息和相关 URL 的链接。接着,报告文件的名称和样本文件也会被报告。
If some formulas have been requested to be reported, it is done in the next section. After is a list of statistics requested in the model specification file. The estimation report follows, including
如果要求报告某些公式,则在下一节中报告。之后是模型规范文件中要求的统计数据列表。接下来是估算报告,包括
  • The number of parameters that have been estimated.
    已估算参数的数量。
  • The number of observations, that is, the number of rows in the data file that have not been excluded.
    观测值的数量,即数据文件中未被排除的行数。
  • The number of excluded observations.
    被排除的观测值的数量。
  • Init likelihood is the likelihood of the sample for the model defined with the default values of the parameters.
    Init likelihood 是以默认参数值定义的模型的样本 likelihood
  • Final likelihood is the likelihood of the sample for the estimated model.
    最终 likelihood 是估计模型样本的 likelihood
  • Likelihood ratio test for the init. model is
    初始模型的似然比检验为
where is the null likelihood of the init model as defined above, and is the likelihood of the sample for the estimated model.
其中, 是上述初始模型的 空可能性, 是估计模型的 样本可能性。
  • Rho-square is Rho 平方是
  • Rho-square-bar is Rho-square-bar 是
where is the number of estimated parameters. Note that this statistic is meaningless in the presence of constraints, where the number of degrees of freedom is less than the number of parameters.
其中 是估计参数的个数。请注意,在自由度小于参数数的约束条件下,该统计量毫无意义。
  • Final gradient norm is the gradient of the log likelihood function computed for the estimated parameters. If no constraint is active at the solution, it should be close to 0 . If there are equality constraints, or if some bound constraints or inequality constraints are active at the
    最终梯度准则是为估计参数计算的对数似然函数的梯度。如果在求解时没有任何约束条件,那么它应该接近于 0。如果存在相等约束条件,或者在解点处存在一些约束条件或不等式约束条件

    solution (that is, they are verified with equality), the gradient may not be close to zero.
    解(即验证它们相等),梯度可能并不接近于零。
  • Diagnostic is the diagnostic reported by the optimization algorithm. If the algorithm has not converged, the estimation results presented in the file cannot be used as such.
    诊断是优化算法报告的诊断结果。如果算法没有收敛,文件中显示的估算结果就不能作为诊断结果使用。
  • Iterations is the number of iterations used by the algorithm before it stopped.
    Iterations 是算法停止前的迭代次数。
  • Run time is the actual time used by the algorithm before it stopped, in minutes and seconds (format ).
    运行时间是算法停止前实际使用的时间,以分秒为单位(格式为 )。
  • Nbr of thread: number of threads that is of processors, used during the estimation.
    Nbr of thread(线程数量):估算过程中使用的线程数量,即处理器数量。
The following section reports the estimates of the parameters of the utility function, together with some statistics. For each parameter , the following is reported:
下一节报告了效用函数参数的估计值以及一些统计数据。对于每个参数 ,报告如下:
  • The name of the parameter.
    参数名称。
  • The estimated value .
    估计值
  • The standard error of the estimate, calculated as the square root of the diagonal entry of the Rao-Cramer bound (see Appendix B).
    估计值的标准误差 ,计算公式为 Rao-Cramer 约束 对角线项的平方根(见附录 B)。
  • The statistics, calculated as .
    统计数据,计算公式为
  • The value, calculated as , where is the cumulative density function of the univariate standard normal distribution.
    值,计算公式为 ,其中 是单变量标准正态分布的累积密度函数。
  • A is appended if the absolute value value of is less than 1.96 , emphasizing a potential lack of statistical significance. In this example, no such sign appears.
    如果 的绝对值小于 1.96,则会附加一个 ,强调可能缺乏统计意义。在本例中,没有出现这样的符号。
  • The robust standard error of the estimate, calculated as the square root of the diagonal entry of the robust estimate of the variance covariance matrix. (see Appendix B).
    估计值的稳健标准误差 ,计算方法为方差协方差矩阵稳健估计值 对角线项的平方根。(见附录 B)。
  • The robust statistics, calculated as .
    稳健 统计量,计算公式为
  • The robust value, calculated as , where is the cumulative density function of the univariate normal distribution.
    稳健 值,计算公式为 ,其中 是单变量正态分布的累积密度函数。
  • A is appended if the absolute value value of is less than 1.96, emphasizing a potential lack of statistical significance. In this example, no such sign appears.
    如果 的绝对值小于 1.96,则会附加 ,强调可能缺乏统计意义。在本例中,没有出现这样的符号。
The last section reports, for each pair of parameters and ,
最后一节报告了每对参数 的情况、
  • the name of ,
  • the name of ,
  • the entry of the Rao-Cramer bound (see Appendix B),
    Rao-Cramer 约束的入口 (见附录 B)、
  • the correlation between and , calculated as
    之间的相关性,计算公式为
  • the statistics, calculated as
    统计量,计算公式为
  • a sign is appended if the absolute value value of is less than 1.96, emphasizing that the hypothesis that the two parameters are equal cannot be rejected at the level (in this example, no such sign appears),
    如果 的绝对值小于 1.96,则会附加一个符号 ,强调在 的水平上不能拒绝两个参数相等的假设(在本例中,没有出现这样的符号)、
  • the entry of , the robust estimate of the variance covariance matrix (see Appendix B),
    的条目 ,方差协方差矩阵的稳健估计值(见附录 B)、
  • the robust correlation between and , calculated as
    之间的稳健相关性,计算公式为
  • the robust statistics, calculated as
    稳健的 统计量,计算公式为
  • a is appended if the absolute value value of is less than 1.96, emphasizing that the hypothesis that the two parameters are equal cannot be rejected at the level (in this example, one such sign appears, for parameters B_COST and B_TIME).
    如果 的绝对值小于 1.96,则附加 ,强调在 水平上不能拒绝两个参数相等的假设(在本例中,参数 B_COST 和 B_TIME 出现了一个这样的符号)。
The final line reports the value of the smallest singular value of the second derivatives matrix. A value close to zero is a sign of singularity, that may be due to a lack of variation in the data or an unidentified model.
最后一行报告了二阶导数矩阵的最小奇异值。接近零的值表示存在奇异性,这可能是由于数据缺乏变化或模型不明造成的。

A Complete specification file
完整的规格文件

A.1 01logit.py
###############################################################################
#
# @file 01logit.py
# @author: Michel Bierlaire, EPFL
# @date: Wed Dec 21 13:23:27 2011
#
# Logit model
# Three alternatives: Train, Car and Swissmetro
# SP data
#
###############################################################################
from biogeme import *
from headers import *
from statistics import *
#Parameters to be estimated
# Arguments:
# - 1 Name for report; Typically, the same as the variable.
# - 2 Starting value.
# - 3 Lower bound.
# - 4 Upper bound.
# - 5 0: estimate the parameter, 1: keep it fixed.
#
ASC_CAR = Beta('ASC_CAR',0,-1000,1000,0,'Car cte.')
ASC_TRAIN = Beta('ASC_TRAIN', 0, -1000,1000,0,'Train cte.')
ASC_SM = Beta('ASC_SM',0,-1000,1000,1,'Swissmetro cte.')
B_TIME = Beta('B_TIME',0,-1000,1000,0,'Travel time')
B_COST = Beta('B_COST',0,-1000,1000,0,'Travel cost')
# Utility functions
#If the person has a GA (season ticket) her incremental cost
#is actually 0 rather than the cost value gathered from the
# network data.
SM_COST = SM_CO * * GA = = N FO

# For numerical reasons, it is good practice to scale the data to
# that the values of the parameters are around 1.0.
# A previous estimation with the unscaled data has generated
# parameters around -0.01 for both cost and time. Therefore, time
# and cost are divided by 100.
# The following statements are designed to preprocess the data.
# It is like creating a new columns in the data file. This should
# be preferred to the statement like
# TRAIN_TT_SCALED = TRAIN_TT / 100.0
# which causes the division to be reevaluated again and again,
# throuh the iterations. For models taking a long time to
# estimate, it may make a significant difference.
TRAIN_TT_SCALED = DefineVariable('TRAIN_TT_SCALED',\
        TRAIN_TT / 100.0)
TRAIN_COST_SCALED = DefineVariable('TRAIN_COST_SCALED',\
        TRAIN_COST / 100)
SM_TT_SCALED = DefineVariable('SM_TT_SCALED', SM_TT / 100.0)
SM_COST_SCALED = DefineVariable('SM_COST_SCALED', SM_COST / 100)
CAR_TT_SCALED = DefineVariable('CAR_TT_SCALED', CAR_TT / 100)
CAR_CO_SCALED = DefineVariable('CAR_CO_SCALED', CAR_CO / 100)
V1 = ASC_TRAIN + \
        B_TIME * TRAIN_TT_SCALED + \
        B_COST * TRAIN_COST_SCALED
V2 = ASC_SM + \
        B_TIME * SM_TT_SCALED + \
        B_COST * SM_COST_SCALED
V3 = ASC_CAR + \
        B_TIME * CAR_TT_SCALED + \
        B_COST * CAR_CO_SCALED
# Associate utility functions with the numbering of alternatives
V = {1: V1,
        2: V2,
        3: V3}
# Associate the availability conditions with the alternatives
CAR_AV_SP = DefineVariable('CAR_AV_SP',CAR_AV * ( SP != 0
))
TRAIN_AV_SP = DefineVariable('TRAIN_AV_SP',TRAIN_AV * ( SP !=
0 ))
av = {1: TRAIN_AV_SP,
        2: SM_AV,
        3: CAR_AV_SP}
# The choice model is a logit, with availability conditions
logprob = bioLogLogit(V,av,CHOICE)
# Defines an itertor on the data
rowIterator('obsIter')
# DEfine the likelihood function for the estimation

# All observations verifying the following expression will not be
# considered for estimation
# The modeler here has developed the model only for work trips.
# Observations such that the dependent variable CHOICE is 0
# are also removed.
exclude = (( PURPOSE != 1 ) * ( PURPOSE != 3 ) + \
    ( CHOICE = 0 )) > <o
BIOGEME_OBJECT.EXCLUDE = exclude
BIOGEME_OBJECT.PARAMETERS['optimizationAlgorithm'] = "IPOPT"
BIOGEME_OBJECT.PARAMETERS['biogemeDisplay'] = "3"
BIOGEME_OBJECT.FORMULAS['Train utility'] = V1
BIOGEME_OBJECT.FORMULAS['Swissmetro utility'] = V2
BIOGEME_OBJECT.FORMULAS['Car utility'] = V3
# Statistics
nullLoglikelihood(av,'obsIter')
choiceSet = [1,2,3]
cteLoglikelihood(choiceSet,CHOICE,'obsIter')
availabilityStatistics(av,'obsIter')
#BIOGEME_OBJECT.PARAMETERS['printGradient'] = "1"
#BIOGEME_OBJECT.PARAMETERS['forceScientificNotation'] = "0"
#BIOGEME_OBJECT.PARAMETERS['precisionParameters'] = "3"
#BIOGEME_OBJECT.PARAMETERS['precisionStatistics'] = "3"
#BIOGEME_OBJECT.PARAMETERS['precisionTStats'] = "14"
#BIOGEME_OBJECT.PARAMETERS['bootstrapStdErr'] = "100"

B Estimation of the variance-covariance matrix
B 方差-协方差矩阵的估计

Under relatively general conditions, the asymptotic variance-covariance matrix of the maximum likelihood estimates of the vector of parameters is given by the Cramer-Rao bound
在相对一般的条件下,参数向量 的最大似然估计值的渐近方差-协方差矩阵由克拉默-拉奥约束给出
The term in square brackets is the matrix of the second derivatives of the likelihood function with respect to the parameters evaluated at the true parameters. Thus the entry in the kth row and the th column is
方括号中的项是 似然函数关于参数的二阶导数矩阵,以真实参数求值。因此,第 k 行和第 列的条目为
Since we do not know the actual values of the parameters at which to evaluate the second derivatives, or the distribution of and over which to take their expected value, we estimate the variance-covariance matrix by evaluating the second derivatives at the estimated parameters and the sample distribution of and instead of their true distribution. Thus we use
由于我们不知道评估二阶导数的参数实际值,也不知道 的分布情况,因此我们通过评估估计参数 的二阶导数以及 的样本分布,而不是它们的真实分布,来估计方差-协方差矩阵。因此,我们使用
as a consistent estimator of the matrix of second derivatives.
作为二次导数矩阵的一致估计值。
Denote this matrix as . Note that, from the second order optimality conditions of the optimization problem, this matrix is negative semi-definite, which is the algebraic equivalent of the local concavity of the log likelihood function. If the maximum is unique, the matrix is negative definite, and the function is locally strictly concave.
将此矩阵记为 。请注意,根据优化问题的二阶最优条件,该矩阵为负半定矩阵,其代数等价于对数似然函数的局部凹性。如果最大值是唯一的,矩阵就是负定的,函数就是局部严格凹的。
An estimate of the Cramer-Rao bound (12) is given by
克拉默-拉奥边界 (12) 的估计值为
If the matrix is negative definite then is invertible and the Cramer-Rao bound is positive definite.
如果矩阵 是负定的,那么 是可逆的,Cramer-Rao 约束是正定的。
Another consistent estimator of the (negative of the) second derivatives matrix can be obtained by the matrix of the cross-products of first derivatives as follows:
二阶导数矩阵(负数)的另一个一致估计值可以通过一阶导数的交叉积矩阵得到,如下所示:
where 其中
is the gradient vector of the likelihood of observation . This approximation is employed by the BHHH algorithm, from the work by Berndt et al. (1974). Therefore, an estimate of the variance-covariance matrix is given by
是观测值 的似然梯度向量。Berndt 等人(1974 年)的 BHHH 算法就采用了这种近似方法。因此,方差-协方差矩阵的估计值为
although it is rarely used. Instead, B is used to derive a third consistent estimator of the variance-covariance matrix of the parameters, defined as
尽管很少使用。相反,B 被用来推导参数方差-协方差矩阵的第三个一致估计值,其定义为
It is called the robust estimator, or sometimes the sandwich estimator, due to the form of equation (19). Biogeme reports statistics based on both the Cramer-Rao estimate (15) and the robust estimate (19).
由于方程 (19) 的形式,它被称为稳健估计器,有时也被称为三明治估计器。Biogeme 根据克莱默-拉奥估计 (15) 和稳健估计 (19) 报告统计数据。
When the true likelihood function is maximized, these estimators are asymptotically equivalent, and the Cramer-Rao bound should be preferred (Kauermann and Carroll, 2001). When other consistent estimators are used, the robust estimator must be used (White, 1982). Consistent non-maximum likelihood estimators, known as pseudo maximum likelihood estimators, are often used when the true likelihood function is unknown or difficult to compute. In such cases, it is often possible to obtain consistent estimators by maximizing an objective function based on a simplified probability distribution.
当真实似然函数最大化时,这些估计器在渐近上是等价的,因此应优先使用 Cramer-Rao 约束(Kauermann 和 Carroll,2001 年)。当使用其他一致估计器时,必须使用稳健估计器(White,1982 年)。当真实似然函数未知或难以计算时,通常会使用一致的非极大似然估计器,即伪极大似然估计器。在这种情况下,通常可以通过最大化基于简化概率分布的目标函数来获得一致的估计值。

References 参考资料

Berndt, E. K., Hall, B. H., Hall, R. E. and Hausman, J. A. (1974). Estimation and inference in nonlinear structural models, Annals of Economic and Social Measurement 3/4: 653-665.
Berndt, E. K., Hall, B. H., Hall, R. E. and Hausman, J. A. (1974).Estimation and inference in nonlinear structural models, Annals of Economic and Social Measurement 3/4: 653-665.
Kauermann, G. and Carroll, R. (2001). A note on the efficiency of sandwich covariance matrix estimation, Journal of the American Statistical Association 96(456).
Kauermann, G. and Carroll, R. (2001).A note on the efficiency of sandwich covariance matrix estimation, Journal of the American Statistical Association 96(456).
White, H. (1982). Maximum likelihood estimation of misspecified models, Econometrica 50: 1-25.
White, H. (1982).Maximum likelihood estimation of misspecified models, Econometrica 50: 1-25.