Connection to Spread Daemon Fails
|Assignee:||S. Wrede||% Done:|
Environment: Mac OS X Lion, Compiler: clang / LLVM 3.0
Every second connection to the Spread daemon fails. This behavior can be reproduced by starting an rsb_informer with Spread transport enabled twice. The second executable hangs in the activate method of the Spread connector as suggested by the following backtrace:
#0 0x00007fff90b1ad78 in recvfrom () #1 0x00000001007d54f0 in recv_nointr_timeout () at sp.c:307 #2 0x00000001007d4468 in SP_connect_timeout (spread_name=<value temporarily unavailable, due to optimizations>, private_name=<value temporarily unavailable, due to optimizations>, priority=Cannot access memory at address 0x1 ) at sp.c:744 #3 0x00000001007d3f6a in SP_connect (spread_name=<value temporarily unavailable, due to optimizations>, private_name=<value temporarily unavailable, due to optimizations>, priority=<value temporarily unavailable, due to optimizations>, group_membership=<value temporarily unavailable, due to optimizations>, mbox=<value temporarily unavailable, due to optimizations>, private_group=<value temporarily unavailable, due to optimizations>) at sp.c:551 #4 0x0000000100172216 in rsb::spread::SpreadConnection::activate (this=0x100e01ef0) at SpreadConnection.cpp:78 #5 0x000000010017b111 in rsb::spread::SpreadConnector::activate (this=0x100e01b80) at SpreadConnector.cpp:69 #6 0x000000010015b1df in rsb::spread::OutConnector::activate (this=0x100e019a0) at OutConnector.cpp:74 #7 0x00000001000ffec6 in rsb::eventprocessing::OutRouteConfigurator::activate (this=0x100e062e0) at OutRouteConfigurator.cpp:81 #8 0x000000010006e83b in rsb::InformerBase::InformerBase (this=0x100e01950, __vtt_parm=0x10000a748, connectors=@0x7fff5fbff878, scope=@0x7fff5fbffac8, config=@0x7fff5fbff9a0, defaultType=@0x7fff5fbff990) at Informer.cpp:44 #9 0x0000000100006d16 in rsb::Informer<std::string>::Informer (this=0x100e01950, connectors=@0x7fff5fbff878, scope=@0x7fff5fbffac8, config=@0x7fff5fbff9a0, type=@0x7fff5fbff990) at Informer.h:251 #10 0x0000000100007556 in rsb::Factory::createInformer<std::string> (this=0x100e04ae0, scope=@0x7fff5fbffac8, config=@0x7fff5fbff9a0, dataType=@0x7fff5fbff990) at Factory.h:80 #11 0x0000000100004522 in main () at informer.cpp:45
This seems to be related to an abnormal behavior of the Spread daemon as it is also not possible to connect with the spread tools to the daemon:
localhost:~ swrede$ /vol/cit/bin/spuser Spread library version is 4.1.0 recv_nointr_timeout: Timed out SP_error: (-8) Connection closed by spread Bye.
#2 Updated by S. Wrede almost 10 years ago
- Status changed from New to Resolved
- Assignee set to S. Wrede
- % Done changed from 0 to 100
Actually, I found the problem. It is not really an upstream bug in Spread but a configuration error. However, it is one of the typical spread configuration bugs that are not reported or is checked against explicitly at startup time. Maybe it is not possible for the Spread guys to perform checks against these kind of error conditions, but I would still think it is. BTW: I am pretty sure this is the same effect we had on Dimitris laptop at the code camp. The following describes what happened for later reference if somebody ever experiences similar problems:
- I used the same spread source package for installing Spread as we use for the Ubuntu / Debian packages. In these packages, I fixed the spread.conf installed by default to use 127.0.1.1 for localhost as is correct on Ubuntu linux. However, this is not true for Mac OS and probably also not for other OS'es.
- I started the Spread daemon as usual with
spread -n localhostand got the following output:
localhost:~ swrede$ /vol/cit/sbin/spread -n localhost /===========================================================================\ | The Spread Toolkit. | | Copyright (c) 1993-2009 Spread Concepts LLC | | All rights reserved. | | | | The Spread toolkit is licensed under the Spread Open-Source License. | | You may only use this software in compliance with the License. | | A copy of the license can be found at http://www.spread.org/license | | | | This product uses software developed by Spread Concepts LLC for use | | in the Spread toolkit. For more information about Spread, | | see http://www.spread.org | | | | This software is distributed on an "AS IS" basis, WITHOUT WARRANTY OF | | ANY KIND, either express or implied. | | | | Creators: | | Yair Amir firstname.lastname@example.org | | Michal Miskin-Amir email@example.com | | Jonathan Stanton firstname.lastname@example.org | | John Schultz email@example.com | | | | Major Contributors: | | Ryan Caudy firstname.lastname@example.org - contribution to process groups.| | Claudiu Danilov email@example.com - scalable, wide-area support. | | Cristina Nita-Rotaru firstname.lastname@example.org - GC security. | | Theo Schlossnagle email@example.com - Perl, autoconf, old skiplist. | | Dan Schoenblum firstname.lastname@example.org - Java interface. | | | | Special thanks to the following for discussions and ideas: | | Ken Birman, Danny Dolev, Jacob Green, Mike Goodrich, Ben Laurie, | | David Shaw, Gene Tsudik, Robbert VanRenesse. | | | | Partial funding provided by the Defense Advanced Research Project Agency | | (DARPA) and the National Security Agency (NSA) 2000-2004. The Spread | | toolkit is not necessarily endorsed by DARPA or the NSA. | | | | For a full list of contributors, see Readme.txt in the distribution. | | | | WWW: www.spread.org www.spreadconcepts.com | | Contact: email@example.com | | | | Version 4.01.00 Built 18/June/2009 | \===========================================================================/ Conf_load_conf_file: using file: /vol/cit/etc/spread.conf Successfully configured Segment 0 [127.0.0.255:4803] with 1 procs: localhost: 127.0.1.1 Finished configuration file. Hash value for this configuration is: 2748807275 Conf_load_conf_file: My name: localhost, id: 127.0.1.1, port: 4803
- After this, (without looking any further on the output), I started the RSB executables exhibiting the behavior described in the issue. This also means that each first connection to Spread worked fine.
However, what I missed is that the Spread daemon output lacks the confirmation about the found Spread daemons. Narf. By changing the installed spread conf ($prefix/etc/spread.conf) to use 127.0.0.1 as IP for localhost which is correct on Lion, the missing lines are printed to the console by the spread daemon:
Successfully configured Segment 0 [127.0.0.255:4803] with 1 procs: localhost: 127.0.0.1 Finished configuration file. Hash value for this configuration is: 1889207152 Conf_load_conf_file: My name: localhost, id: 127.0.0.1, port: 4803 Membership id is ( 2130706433, 1320161499) -------------------- Configuration at localhost is: Num Segments 1 1 127.0.0.255 4803 localhost 127.0.0.1 ====================
After this change, everything works fine again. So, basically the "successfully configured" line and the fact that client applications can connect for a single time successfully to the Spread daemon were the puzzling issues.
Probably, we should add a note in a Spread-specific transport page that one should closely look at the output of the Spread daemon and there in particular check the "Configuration at..." lines.