Bug #309: Create pure spread test case for latencies - Robotics Service Bus - Open Source Collaboration Platform

Bug #309

Create pure spread test case for latencies

Added by J. Wienke almost 13 years ago. Updated about 12 years ago.

Status:

Resolved

Start date:

05/24/2011

Priority:

Normal

Due date:

Assignee:

J. Wienke

% Done:

100%

Category:

Target version:

0.6

Description

Some receive calls seem to take up to 100 ms in some strange conditions of the informing side (disconnecting before all messages are received???):

receive took 0 ms
receive took 0 ms
receive took 0 ms
receive took 1 ms
receive took 0 ms
receive took 99 ms
receive took 0 ms
receive took 0 ms
receive took 0 ms
[C++    Listener] Scope[/size50000/sub1/sub2/]: Event 1171/1200 received: 0x1fdbb00
receive took 0 ms
receive took 0 ms
receive took 0 ms
receive took 0 ms
receive took 0 ms
receive took 0 ms
receive took 0 ms
receive took 0 ms
receive took 99 ms
receive took 0 ms
receive took 0 ms
receive took 0 ms
receive took 0 ms
receive took 0 ms
receive took 0 ms
receive took 0 ms
receive took 0 ms
receive took 99 ms
receive took 0 ms
receive took 0 ms
receive took 0 ms
receive took 0 ms

History

#1 Updated by J. Wienke almost 13 years ago

Status changed from New to In Progress
Assignee set to J. Wienke

#2 Updated by J. Wienke almost 13 years ago

% Done changed from 0 to 50

A mail is sent to the spread list with this program:

//============================================================================
// Name        : spreadtest.cpp
// Author      : Johannes Wienke
// Version     :
// Copyright   : 2011, CoR-Lab, Bielefeld University
// Description : Hello World in C++, Ansi-style
//============================================================================

#include <algorithm>
#include <iostream>
#include <string>

#include <assert.h>
#include <string.h>
#include <sys/time.h>
#include <time.h>

#include <boost/cstdint.hpp>
#include <boost/thread.hpp>
#include <boost/bind.hpp>

#include <sp.h>

using namespace std;

char randAlnumChar() {
    char c;
    do {
        c = (char) (rand() % ('z' - '0' + 1) + '0');
    } while (!std::isalnum(c));
    return c;
}

std::string randAlnumStr(const std::string::size_type &length) {
    std::string s;
    s.reserve(length);
    generate_n(std::back_inserter(s), length, randAlnumChar);
    return s;
}

void currentTimeMicros(boost::uint64_t &result) {

    timeval tv;
    gettimeofday(&tv, NULL);

    // (second-value in microseconds) + (microsecond value)
    result = (((boost::uint64_t) tv.tv_sec) * 1000000ull) + (tv.tv_usec);

}

unsigned int numMessages = 1000;
unsigned int dataLength = 100000;

void sender() {

    char privateGroup[MAX_PRIVATE_NAME + 1000];
    mailbox sendMbox;
    int connected = SP_connect("4803", 0, 0, false, &sendMbox, privateGroup);
    assert(connected == ACCEPT_SESSION);

    string groupName = "foo";

    // first send some messages
    string data = randAlnumStr(dataLength);
    for (unsigned int i = 0; i < numMessages; ++i) {
        int sent = SP_multicast(sendMbox, RELIABLE_MESS, groupName.c_str(), 0,
                strlen(data.c_str()) + 1, data.c_str());
        assert(sent >= 0);
    }

    int disconnected = SP_disconnect(sendMbox);
    assert(disconnected == 0);
    cout << "##### disconnected!" << endl;

}

int main() {

    boost::thread senderThread(&sender);

    char privateGroup[MAX_PRIVATE_NAME];
    mailbox receiveMbox;
    int connected = SP_connect("4803", 0, 0, false, &receiveMbox, privateGroup);
    assert(connected == ACCEPT_SESSION);

    string groupName = "foo";
    SP_join(receiveMbox, groupName.c_str());

    // then receive them
    service service;
    char sender[MAX_GROUP_NAME];
    int maxGroups = 10;
    int numGroups = 0;
    char groups[maxGroups][MAX_GROUP_NAME];
    short int messageType;
    int endian;
    char message[dataLength + 10];
    boost::uint64_t before;
    boost::uint64_t after;
    for (unsigned int i = 0; i < numMessages; ++i) {
        currentTimeMicros(before);
        int size = SP_receive(receiveMbox, &service, sender, maxGroups,
                &numGroups, groups, &messageType, &endian, dataLength + 10,
                message);
        currentTimeMicros(after);
        cout << "receive took " << (after - before) << " us" << endl;
        assert(size >= 0);
        boost::this_thread::sleep(boost::posix_time::milliseconds(1));
    }

    cout << "received messages" << endl;

    // leave group, disconnect mailboxes
    int left = SP_leave(receiveMbox, groupName.c_str());
    assert(left == 0);
    int disconnected = SP_disconnect(receiveMbox);
    assert(disconnected == 0);

    senderThread.join();

    return 0;

}

#3 Updated by J. Wienke almost 13 years ago

Spread website and list are completely down since several days... So no delivery of the message yet.

#4 Updated by J. Wienke almost 13 years ago

Target version deleted (~~0.3~~)

As the spread website and mailing list are still completely inaccessible, I will remove this ticket from the version 0.3. It will be resolved when there is contact to the spread community again.

#5 Updated by J. Wienke almost 13 years ago

% Done changed from 50 to 70

Finally sent to the list...

#6 Updated by J. Wienke over 12 years ago

Status changed from In Progress to Feedback
Assignee deleted (~~J. Wienke~~)

There is no reply from the list since weeks. I will more or less close this right now as I do not feel that I have time ton investigate in this. This is the list thread: http://lists.spread.org/pipermail/spread-users/2011-August/thread.html#4462

#7 Updated by J. Wienke over 12 years ago

Reply from the list:

Johannes + everybody,

I've looked into this report (thanks for the demonstration app!) and have figured out both complaints.

The first complaint is that a receiver can sometimes see latencies of up to 100ms from SP_receive even when lots of traffic has been sent to them.

This happens because of the way Spread currently deals with writes to client sockets that block. When a write to a client socket blocks, Spread schedules a Badger_timeout (by default 100ms) callback to try writing to the socket again in the future. If, in the meantime, another message is queued on the connection, then it proactively tries to send immediately. What this means is that if the daemon has a bit of a back log of messages to send to a receiver and its TCP socket blocks and no more traffic is subsequently queued to the connection, then you will see up to a 100ms pause until the daemon tries to send to it again. It is possible that it will fill up the TCP buffers and block again (quickly), especially with very large messages like in the example, and schedule another timeout for another 100 ms. So, you can trigger the very stutter-stop kind of performance that you reported.

One simple way to mitigate this issue is to reduce Badger_timeout (top of session.c) significantly (i.e. - by a factor of 10x - 100x). The drawback to this is the overhead of "polling" truly blocked sockets every Badger_timeout.

The "better way" to mitigate this issue is to instead have Spread monitor write-ability of client sockets and push out traffic as the sockets are able. We need to think if doing this would have any adverse effects versus the way Spread currently handles client I/O.

The second complaint was that you got a stack corruption bug when you passed a privateGroup array of only MAX_PRIVATE_NAME characters. This is expected as the privateGroup array is expected to be MAX_GROUP_NAME characters long.

Sorry it took so long to get around to diagnosing these issues.

Cheers!

#8 Updated by J. Wienke about 12 years ago

Status changed from Feedback to Resolved
Assignee set to J. Wienke
Target version set to 0.6
% Done changed from 70 to 100

Another reply:

Johannes,

In the new version of Spread we are currently developing we believe we've solved this issue. Spread will now use both the old badger timeout semantics and also monitor client sockets for write-ability. So, the issue you saw should no longer exist in the next version of Spread which is due out soon.

Cheers!

I mark this as resolved as we have an upstream fix now.

Also available in: Atom PDF

Robotics Service Bus

Issues