Bug #695

Connection to Spread Daemon Fails

Added by S. Wrede almost 10 years ago. Updated almost 10 years ago.

Status:ResolvedStart date:11/01/2011
Priority:NormalDue date:
Assignee:S. Wrede% Done:


Category:Spread Connector
Target version:-


Environment: Mac OS X Lion, Compiler: clang / LLVM 3.0

Every second connection to the Spread daemon fails. This behavior can be reproduced by starting an rsb_informer with Spread transport enabled twice. The second executable hangs in the activate method of the Spread connector as suggested by the following backtrace:

#0  0x00007fff90b1ad78 in recvfrom ()
#1  0x00000001007d54f0 in recv_nointr_timeout () at sp.c:307
#2  0x00000001007d4468 in SP_connect_timeout (spread_name=<value temporarily unavailable, due to optimizations>, private_name=<value temporarily unavailable, due to optimizations>, priority=Cannot access memory at address 0x1
) at sp.c:744
#3  0x00000001007d3f6a in SP_connect (spread_name=<value temporarily unavailable, due to optimizations>, private_name=<value temporarily unavailable, due to optimizations>, priority=<value temporarily unavailable, due to optimizations>, group_membership=<value temporarily unavailable, due to optimizations>, mbox=<value temporarily unavailable, due to optimizations>, private_group=<value temporarily unavailable, due to optimizations>) at sp.c:551
#4  0x0000000100172216 in rsb::spread::SpreadConnection::activate (this=0x100e01ef0) at SpreadConnection.cpp:78
#5  0x000000010017b111 in rsb::spread::SpreadConnector::activate (this=0x100e01b80) at SpreadConnector.cpp:69
#6  0x000000010015b1df in rsb::spread::OutConnector::activate (this=0x100e019a0) at OutConnector.cpp:74
#7  0x00000001000ffec6 in rsb::eventprocessing::OutRouteConfigurator::activate (this=0x100e062e0) at OutRouteConfigurator.cpp:81
#8  0x000000010006e83b in rsb::InformerBase::InformerBase (this=0x100e01950, __vtt_parm=0x10000a748, connectors=@0x7fff5fbff878, scope=@0x7fff5fbffac8, config=@0x7fff5fbff9a0, defaultType=@0x7fff5fbff990) at Informer.cpp:44
#9  0x0000000100006d16 in rsb::Informer<std::string>::Informer (this=0x100e01950, connectors=@0x7fff5fbff878, scope=@0x7fff5fbffac8, config=@0x7fff5fbff9a0, type=@0x7fff5fbff990) at Informer.h:251
#10 0x0000000100007556 in rsb::Factory::createInformer<std::string> (this=0x100e04ae0, scope=@0x7fff5fbffac8, config=@0x7fff5fbff9a0, dataType=@0x7fff5fbff990) at Factory.h:80
#11 0x0000000100004522 in main () at informer.cpp:45

This seems to be related to an abnormal behavior of the Spread daemon as it is also not possible to connect with the spread tools to the daemon:

localhost:~ swrede$ /vol/cit/bin/spuser 
Spread library version is 4.1.0
recv_nointr_timeout: Timed out
SP_error: (-8) Connection closed by spread


Related issues

Related to Robotics Service Bus - Tasks #528: Add a "Fixing the Network" Wiki page (also documenting Sp... Closed 08/31/2011


#1 Updated by J. Wienke almost 10 years ago

Then this is clearly an upstream error. What should we do about this?

#2 Updated by S. Wrede almost 10 years ago

  • Status changed from New to Resolved
  • Assignee set to S. Wrede
  • % Done changed from 0 to 100

Actually, I found the problem. It is not really an upstream bug in Spread but a configuration error. However, it is one of the typical spread configuration bugs that are not reported or is checked against explicitly at startup time. Maybe it is not possible for the Spread guys to perform checks against these kind of error conditions, but I would still think it is. BTW: I am pretty sure this is the same effect we had on Dimitris laptop at the code camp. The following describes what happened for later reference if somebody ever experiences similar problems:

  1. I used the same spread source package for installing Spread as we use for the Ubuntu / Debian packages. In these packages, I fixed the spread.conf installed by default to use for localhost as is correct on Ubuntu linux. However, this is not true for Mac OS and probably also not for other OS'es.
  2. I started the Spread daemon as usual with spread -n localhost and got the following output:
    localhost:~ swrede$ /vol/cit/sbin/spread -n localhost
    | The Spread Toolkit.                                                       |
    | Copyright (c) 1993-2009 Spread Concepts LLC                               |
    | All rights reserved.                                                      |
    |                                                                           |
    | The Spread toolkit is licensed under the Spread Open-Source License.      |
    | You may only use this software in compliance with the License.            |
    | A copy of the license can be found at http://www.spread.org/license       |
    |                                                                           |
    | This product uses software developed by Spread Concepts LLC for use       |
    | in the Spread toolkit. For more information about Spread,                 |
    | see http://www.spread.org                                                 |
    |                                                                           |
    | This software is distributed on an "AS IS" basis, WITHOUT WARRANTY OF     |
    | ANY KIND, either express or implied.                                      |
    |                                                                           |
    | Creators:                                                                 |
    |    Yair Amir             yairamir@cs.jhu.edu                              |
    |    Michal Miskin-Amir    michal@spreadconcepts.com                        |
    |    Jonathan Stanton      jstanton@gwu.edu                                 |
    |    John Schultz          jschultz@spreadconcepts.com                      |
    |                                                                           |
    | Major Contributors:                                                       |
    |    Ryan Caudy           rcaudy@gmail.com - contribution to process groups.|
    |    Claudiu Danilov      claudiu@acm.org - scalable, wide-area support.    |
    |    Cristina Nita-Rotaru crisn@cs.purdue.edu - GC security.                |
    |    Theo Schlossnagle    jesus@omniti.com - Perl, autoconf, old skiplist.  |
    |    Dan Schoenblum       dansch@cnds.jhu.edu - Java interface.             |
    |                                                                           |
    | Special thanks to the following for discussions and ideas:                |
    |    Ken Birman, Danny Dolev, Jacob Green, Mike Goodrich, Ben Laurie,       |
    |    David Shaw, Gene Tsudik, Robbert VanRenesse.                           |
    |                                                                           |
    | Partial funding provided by the Defense Advanced Research Project Agency  |
    | (DARPA) and the National Security Agency (NSA) 2000-2004. The Spread      |
    | toolkit is not necessarily endorsed by DARPA or the NSA.                  |
    |                                                                           |
    | For a full list of contributors, see Readme.txt in the distribution.      |
    |                                                                           |
    | WWW:     www.spread.org     www.spreadconcepts.com                        |
    | Contact: info@spreadconcepts.com                                          |
    |                                                                           |
    | Version 4.01.00 Built 18/June/2009                                        |
    Conf_load_conf_file: using file: /vol/cit/etc/spread.conf
    Successfully configured Segment 0 [] with 1 procs:
    Finished configuration file.
    Hash value for this configuration is: 2748807275
    Conf_load_conf_file: My name: localhost, id:, port: 4803
  3. After this, (without looking any further on the output), I started the RSB executables exhibiting the behavior described in the issue. This also means that each first connection to Spread worked fine.

However, what I missed is that the Spread daemon output lacks the confirmation about the found Spread daemons. Narf. By changing the installed spread conf ($prefix/etc/spread.conf) to use as IP for localhost which is correct on Lion, the missing lines are printed to the console by the spread daemon:

Successfully configured Segment 0 [] with 1 procs:
Finished configuration file.
Hash value for this configuration is: 1889207152
Conf_load_conf_file: My name: localhost, id:, port: 4803
Membership id is ( 2130706433, 1320161499)
Configuration at localhost is:
Num Segments 1
    1       4803

After this change, everything works fine again. So, basically the "successfully configured" line and the fact that client applications can connect for a single time successfully to the Spread daemon were the puzzling issues.

Probably, we should add a note in a Spread-specific transport page that one should closely look at the output of the Spread daemon and there in particular check the "Configuration at..." lines.

Also available in: Atom PDF