Author Topic: UDT congestion control algorithm implemented  (Read 39728 times)

AshMcConnell

  • Full Member
  • ***
  • Posts: 169
  • Karma: 3
    • View Profile
    • Sirocco Racing Simulator
Re: UDT congestion control algorithm implemented
« Reply #30 on: August 15, 2009, 03:26:31 AM »
It might be that because the new version is faster (CPU-wise) that it sends out your data right away rather than grouping it together in one datagram. Try setting the sleep timer to a higher value (10,15,even 30) and seeing how that affects your bandwidth.

Hi Rak'kar,

I did some testing at the various values, 10 seems to be the best value.  I have tweaked the LOD levels again, the quality wasn't quite enough.  With the new UDT version I got about ~280Kbits/sec and with the old method I got ~240K/Sec.  I must be doing something silly, can't figure out what tho :)

All the best,
Ash

OvermindDL1

  • Anti-Spam Moderator
  • Hero Member
  • ****
  • Posts: 855
  • Karma: 40
  • Programmer
    • View Profile
    • OvermindDL1's Site
Re: UDT congestion control algorithm implemented
« Reply #31 on: October 16, 2010, 12:14:39 AM »
Now this is fascinating, RakNet is using UDT code in its backend now?  I dropped out of the forums after I had not been using RakNet in any projects for a few months, I ended up using UDT later on (currently integrating it with Boost.ASIO so I have the reactor pattern for use, intend to submit patches later when I confirm it works).  It looks like it gave you a boon as well, surprising actually, I always noticed RakNet being very fast.

Right now I am debating on building this new big project of mine on RakNet instead of UDT (through ASIO), slightly fitted better for some RakNet style features, but I have really grown attached to the reactor style interface that Boost.ASIO uses...

Rak'kar

  • Administrator
  • Hero Member
  • *****
  • Posts: 6895
  • Karma: 291
    • View Profile
    • RakNet
Re: UDT congestion control algorithm implemented
« Reply #32 on: October 16, 2010, 09:53:39 AM »
Actually it never used straight UDT, and in SVN right now it's using the sliding window algorithm which works better in China because it is not ping dependent.

Can you describe what you like about the boost interfaces? I might be able to implement something similar. Also what led you to use that instead of RakNet?

OvermindDL1

  • Anti-Spam Moderator
  • Hero Member
  • ****
  • Posts: 855
  • Karma: 40
  • Programmer
    • View Profile
    • OvermindDL1's Site
Re: UDT congestion control algorithm implemented
« Reply #33 on: October 17, 2010, 03:38:31 AM »
Yeah I noticed by looking through your defunct blog (courtesy of Google Cache back from April) that you tried to integrate it but it did not fit your design well so you took its window algorithm from it and that provided noticeable boons.  I am curious, what is special about China's network that causes it to be less efficient to have to go back to sliding window?  Certainly their pings are not 'that' bad?

The reactor interface just lets me do a whole lot more then just networking, it lets me register callbacks, timers (of all various sorts), lets me implement a coroutine pattern on top of it (which *VASTLY* simplifies so much code, not a 'true' coroutine of course, this is C++ after all, but it emulates it well enough that the scaffolding code of asynchronous callbacks all but vanish).

Basically, boost::asio::io_service is a kind of asynchronous 'pump', I create one of these in my app (there are a few occasions that creating multiple may be better, but not the way I am writing things).  If I need synchronization then I use boost::asio::strand's to 'serialize' access to specific callbacks.

For my networking I register a tcp/udp/udt acceptor with the io_service which starts a platform independent method of asynchronous, safe, and efficient polling (although I think RakNet may well out perform it due to your highly optimized back-end).

Hmm, let me demonstrate this with some code, this is copy-pasted from one of my last projects (a quicky designed just to 'work', not be pretty, but work it does very well.

First, the scaffolding:

Create the server, to make debugging easier to force a thread count (usually to 1 as I have here, but sometime 2-4 as well to test concurrency, oddly I have had no concurrency related issues since I have started this style of programming):
Code: [Select]
#ifdef _DEBUG
Server server(_threadCount=1);
#else
Server server;
#endif

Running it is as simple as this:
Code: [Select]
    server.run();

This all happens in main, basically all my main function does is register a console control handler if nasty ifdef blocks for various OS's, instantiate the server, run it, then exit, where the console control handler's hook things like close, ctrl+c and so forth and just call server.stop().  Do note, server.stop() is just implemented as this:
Code: [Select]
    io_service_.stop();
And nothing else.  Do note, this does not just 'stop' all the async callbacks through boost asio but sends a 'kill' error code to all the waiting handlers.  Every boost::asio handler takes a variety of parameters, depending on what they are waiting on, except for one that is always passed in (and in my style, is always the first param):
Code: [Select]
boost::system::error_codeand I always assign it to ec, nice and simple, and pretty much in every callback I have I test this before doing anything else:
Code: [Select]
if (!ec)
{
    // do closing stuff to clean up handler
    return; // without processing
}

The ec may also pass handler specific error codes as well, but in the generic case this always suffices, and I rarely have any cleanup code, I usually just return, which will also auto-destruct the handler with how I have it built (it does not normally do that).

What io_service does is just handle the concurrency, it takes data given to it from various sources and calls a handler when next available, for example, when the TCP acceptor receives an incoming connection and a handler is waiting for the connection, it calls the handler's asynchronous callback, where the io_service tested the TCP acceptor for incoming data (it is more efficient then it sounds, take a look at the back-end).

What my run method on the server does is just this:
Code: [Select]
void Server::run( void )
{
std::cout << "Running server" << std::endl; // as stated, this was a quick-together server for something real quick, not pretty, but it works
try
{
// *SNIP* register various server services here

boost::thread_group threads;
for(size_t i=0;i<threadCount_;++i) // threadCount_ defaults to boost::thread::hardware_concurrency() if not passed to the constructor
threads.create_thread(boost::bind(&boost::asio::io_service::run, &io_service));
threads.join_all();

std::cout << "Gracefully exiting..." << std::endl;
}
catch(std::exception& e)
{
std::cerr << "Exception-exiting server: " << e.what() << std::endl;
}
std::cout << "Exited" << std::endl;
}

So that is pretty much the poller for everything that I will register with it at the *SNIP*'d part.  Such things are handler which need to be run every 500ms (and it does not vary, if it runs a few milliseconds later then it should then it offsets as appropriate to keep it at the 500ms mark, and you get the variance internally so you can adjust if necessary, which I do not need to for this server), to setting up handlers that respond to some input data from an input source, so a handler that watches when a file on the hard drive changes (can Google this plugin to asio, it is easy to make plugins for asio and this one just uses system-dependent methods to watch if a file changes on the filesystem, or falls back to a polling timer, and calls a callback when it changes to handle the change), to setting up the network acceptor, which in this case looks like this (as stated, I use a coroutine pattern so they may look a little bit foreign to normal C++ programmers, but this let me cut out a *lot* of programming time with such a simple server, this is real C++, just with a bit of macro usage to hide boilerplate):
Code: [Select]
void NetworkServer::operator()( boost::system::error_code ec /*= boost::system::error_code()*/, std::size_t length /*= 0*/ )
{
    if (!ec)
    {
        reenter (this)
        {
            // Loop to accept incoming connections.
            std::cout << "Running network listener..." << std::endl;
            do
{
// Create a new socket for the next incoming connection.
this_->socket_.reset(new boost::asio::ip::tcp::socket(acceptor_->get_io_service()));

// Accept a new connection.
yield this_->acceptor_->async_accept(*this_->socket_, *this);

std::cout << " Incoming connection:  " << this_->socket_->remote_endpoint().address().to_string() << std::endl;

fork NetworkServer(*this)();
} while(is_parent());

// Then this is an incoming client connection

buffer_.reset(new std::vector<unsigned char>);

buffer_->reserve(64);

ParseTag();
if(this_->tag_!=0x02) // handshake
{
std::cout << " Client did not send a handshake as the first packet, instead it sent 0x" << std::hex << (size_t)this_->tag_ << " from address: " << this_->socket_->remote_endpoint().address().to_string() << "  Disconnecting!" << std::endl;
goto ExitDueToError; // Yes, I know, ugly and all that, but this is a quick server and no cleanup is necessary
}
ParseStr((this_->username_));

std::cout << " Handshake: " << this_->socket_->remote_endpoint().address().to_string() << " - " << (this_->username_) << std::endl;


this_->buffer_.resize(1+2+serverName.length());
(this_->buffer_)[0] = 2;
(*reinterpret_cast<short*>(&((this_->buffer_)[1]))) = SwapBytes<sizeof(short)>(static_cast<short>(serverName.length()));
for(size_t i=0;i<serverName.length();++i)
(this_->buffer_)[i+2+1] = serverName[i];
WriteBufferOut();

// Make sure we receive the clients login packet before we do anything else!
ParseTag();
if(this_->tag_!=0x01) // Login
{
std::cout << " Client did not send a login as the second packet, instead it sent 0x" << std::hex << this_->tag_ << " from address: " << this_->socket_->remote_endpoint().address().to_string() << "  Disconnecting!" << std::endl;
goto ExitDueToError;
}
ParseBytes(4);
{
int x = ConvertTypeFromBufferAtIdx(int, 0);
if(x!=2)
{
std::cout << " Client is using unsupported version (" << x << ") in login: " << this_->socket_->remote_endpoint().address().to_string() << std::endl;
goto ExitDueToError;
}
}
ParseStr((this_->username_)); // username
std::cout << " Login request: " << this_->socket_->remote_endpoint().address().to_string() << " - " << (this_->username_) << std::endl;
SkipStr(); // password

// Send the server login packet...
this_->buffer_.resize(1+4+2+2);
(this_->buffer_)[0]=1;
(this_->buffer_)[1]=0;
(this_->buffer_)[2]=0;
(this_->buffer_)[3]=0;
(this_->buffer_)[4]=0;
(this_->buffer_)[5]=0;
(this_->buffer_)[6]=0;
(this_->buffer_)[7]=0;
(this_->buffer_)[8]=0;
std::cout << " Sending server login packet to client..." << std::flush;
WriteBufferOut();
std::cout << "complete" << std::endl;

/// Primary dispatcher loop
for(;;)
{
ParseTag();
if(this_->tag_==0xFF) // Disconnect
{
std::cout << " Player \"" << (this_->username_) << "\": Requested Disconnect, reason: " << std::flush;
ParseStr((this_->username_));
std::cout << (this_->username_) << std::endl;
goto ExitDueToExiting;
}
if(!packetCB_[this_->tag_])
{
std::cout << " Client sent an unhandled packet: 0x" << std::hex << (size_t)this_->tag_ << " - from: \"" << (this_->username_) << "\" - dropping connection:  " << this_->socket_->remote_endpoint().address().to_string() << std::endl;
goto ExitDueToError;
}
this_->subCoroutine_ = coroutine();
while(!(packetCB_[this_->tag_](this)))
{
yield;
}
}

// *SNIP* ugly cleanup goto label's and stuff

std::cout << " Closing socket: " << this_->socket_->remote_endpoint().address().to_string() << " - " << std::flush;
this_->socket_->shutdown(boost::asio::ip::tcp::socket::shutdown_both, ec);
std::cout << "complete" << std::endl;
}
}
else
{
std::cout << " Error from socket " << this_->socket_->remote_endpoint().address().to_string() << " - code: " << ec << " - " << ec.message() << std::endl;
}
}

Yes, difficult as though this may be to believe, that is a fully asynchronous server running in as many threads as the computer has cores and can easily peg to 100% cpu with full utilization (which I have not yet reached admittedly, barely registers as a 0.1% blip of CPU usage).  There is no sleep()'ing going on, no busy-wait polling loop, etc...  And do note, the packetCB_ callback is a plain array of 255 in size that the optimizer does pretty good at optimizatiing the function calls out.

My past, admittedly much more complex, servers based on this design have scaled exceedingly well, and for both http server and game clients this pattern works very well, I just had to learn to split up operations on various parts of the system into smaller tasks with few interdependencies to maximize concurrency (and in some places I 'duplicate' memory data to use a data-oriented programming pattern that lets me work over 'data' like lookups, physics, sound, etc... in large chunks without calling functions repeatedly, I get some *very* nice speed boosts over traditional OO hierarchies using this pattern).

So basically, what I like about this interface it allowed me to handle not only networking, but input, file management (I can even start loading a file in the background and register 'yet another' asynchronous callback to be called when it completes), timers, everything in one, highly optimized, asynchronous interface that lets me build vastly more powerful constructs such as coroutines on top (the creator of ASIO had this pattern in mind when he created, but did not force it on anyone anywhere, just but a few entires on his blog where he shows the wonders of this design).

It would be 'possible' to integrate RakNet into the back-end non-intrusively, but that would probably involve having it register a timer to go off, say, even 5/10/20/whatever milliseconds to poll the RakNet thread then post it to the io_service callback handlers.  The proper way to do it would be to have RakNet post its messages to the io_service and callback handlers directly without polling, RakNet could even expose different stream and non-stream interfaces for its various methods of ordered and so forth.  Hmm, I wonder what if RakNet's interface was built on top of ASIO's UDP acceptor from the get-go...

Well, as an example when I "this_->acceptor_->async_accept(*this_->socket_, *this);" accept a new connection as above, the two parameters are a socket (specialized on the tcp type as above, but could be udp or udt as well, or any others that are made), and the callback (which in this case, thanks to the yield, will continue right after the yield'd line), when it is called again (when execution continues on the next line) the socket is filled in with the socket (or if something happened and the network died then the ec error code would have information about that, but in this function I do not care, if there is an error I just shut down the network and issue a log and so forth), it then continues on and forks the coroutine to continue past and it eventually hits ParseTag();, which is just a macro defined as:
Code: [Select]
#define ParseTag() \
this_->buffer_.resize(1); \
yield this_->socket_->async_read_some(boost::asio::buffer(&((this_->buffer_)[0]),1), *this); \
this_->tag_=(this_->buffer_)[0];
I tell it that I just want 1 byte off the TCP stream, it yields, and when there is at least one byte on the stream it continues to the next line where I pull it off, if this was UDP you would probably just use async_read(..stuff..) to get the whole packet instead of just bits and pieces.  This is in no way the most efficient, I would probably just tell it to get up to a certain large buffer size and parse that out piecemeal in-function, but as I was whipping up this server I just wanted it done, and it did not need to be efficient.

But how async_read_some works it they get what is needed there, and issue a post() call to the io_service with function information of what needs to be done, when that ripples down the async queue (it might be immediatly next if there are a lot of timers sleeping or so) and gets called where it sees if the data is there, if not it posts again and keeps posting until data is available, upon which it calls the handler to handle it.  The data check is not checking for data on the TCP buffer itself, but is rather checking the post buffer.  On Windows, for example, it uses io completion ports in the background, and when there is incoming data it posts that data to the io_service to be handled when something is available.  It is optimized a lot better then how I am describing it, but this gets the idea across a lot more simply.  On Windows, the iocp call is this:
Code: [Select]
  template <typename MutableBufferSequence, typename Handler>
  void async_receive_from(implementation_type& impl,
      const MutableBufferSequence& buffers, endpoint_type& sender_endp,
      socket_base::message_flags flags, Handler handler)
  {
    // Allocate and construct an operation to wrap the handler.
    typedef win_iocp_socket_recvfrom_op<
      MutableBufferSequence, endpoint_type, Handler> op;
    typename op::ptr p = { boost::addressof(handler),
      boost_asio_handler_alloc_helpers::allocate(
        sizeof(op), handler), 0 };
    p.p = new (p.v) op(sender_endp, impl.cancel_token_, buffers, handler);

    buffer_sequence_adapter<boost::asio::mutable_buffer,
        MutableBufferSequence> bufs(buffers);

    start_receive_from_op(impl, bufs.buffers(), bufs.count(),
        sender_endp.data(), flags, &p.p->endpoint_size(), p.p);
    p.v = p.p = 0;
  }

And basically the start_receive_from_op readies some other things, but its primary purpose is to call "iocp_service_.on_completion(op, boost::asio::error::bad_descriptor);" which calls PostQueuedCompletionStatus (it also tests if the request is valid, if the port is valid, etc... and so forth and *it* is what posts the error_code to the handler if any of that fails and a few other things, the optimizer optimizes out near all of these calls due to their design being inlined and templates).

I am getting too long here, ask if you need something clarified.

Rak'kar

  • Administrator
  • Hero Member
  • *****
  • Posts: 6895
  • Karma: 291
    • View Profile
    • RakNet
Re: UDT congestion control algorithm implemented
« Reply #34 on: October 17, 2010, 10:29:21 AM »
Wow, that's all pretty complex. I'll probably have to just try coding some of that myself to really figure out what is going on.

The problem with ping-based congestion control is that I need magic numbers at some point, and if I set it too high or too low for a given network it doesn't run well. For example, China may have pings with a high deviation, so the magic numbers need to tolerate spikes. But those same numbers in the US may result in high packetloss where pings are more consistent. Also, running over a LAN takes different numbers over the internet. The final numbers I used were pretty good, and I had a lot of code to analyze the ping to pick numbers automatically, but even with that I still get some complaints from China where it doesn't work.

Also, depending on how some games use threads, this could result in ping spikes. I had one customer where every so often the network thread would not run for about 30 milliseconds, which was enough to be considered a spike, which caused the congestion control at the time to back off, resulting in very slow throughput.

Plus, it's very hard to debug as you can imagine.

When I switched to sliding window, under all the tests I normally run it was just as good and I feel MUCH more confident about it because the whole ping issue is avoided now.

OvermindDL1

  • Anti-Spam Moderator
  • Hero Member
  • ****
  • Posts: 855
  • Karma: 40
  • Programmer
    • View Profile
    • OvermindDL1's Site
Re: UDT congestion control algorithm implemented
« Reply #35 on: October 17, 2010, 01:47:40 PM »
Wow, that's all pretty complex. I'll probably have to just try coding some of that myself to really figure out what is going on.
It is quite easy once you start using it, but it does have a rather high initial learning curve.

The problem with ping-based congestion control is that I need magic numbers at some point, and if I set it too high or too low for a given network it doesn't run well. For example, China may have pings with a high deviation, so the magic numbers need to tolerate spikes. But those same numbers in the US may result in high packetloss where pings are more consistent. Also, running over a LAN takes different numbers over the internet. The final numbers I used were pretty good, and I had a lot of code to analyze the ping to pick numbers automatically, but even with that I still get some complaints from China where it doesn't work.
Hmm, perhaps it would be good to template your engine?  You could have a template parameter that is a functor type to handle such things, one for Lan, one for high latency, one for etc...

You would just call the functor from your code and if the templating is done properly and specified in the header that is included into it then it should be inlined just fine.

Also, depending on how some games use threads, this could result in ping spikes. I had one customer where every so often the network thread would not run for about 30 milliseconds, which was enough to be considered a spike, which caused the congestion control at the time to back off, resulting in very slow throughput.
Sounds like a client bug, not a bug in your library, they should fix their code.

Plus, it's very hard to debug as you can imagine.

When I switched to sliding window, under all the tests I normally run it was just as good and I feel MUCH more confident about it because the whole ping issue is avoided now.
Just as good?  Any performance metrics?  Won't sliding window cap out on higher bandwidth networks unless it was abnormally large?

Rak'kar

  • Administrator
  • Hero Member
  • *****
  • Posts: 6895
  • Karma: 291
    • View Profile
    • RakNet
Re: UDT congestion control algorithm implemented
« Reply #36 on: October 17, 2010, 06:16:04 PM »
I generally say it works if it transfers a file at the bandwidth capacity without more than 1 or 2 percent packetloss.

Sliding window doubles for each ack until you get packetloss. So it should be fine for the general case. However, RakNet will never be as fast as UDT for LAN transfers because it supports all those other settings for sending data which then takes extra computation and bandwidth. If I wanted it to be as fast as UDT I'd take out all the extra features, in which case you might as well use UDT.

OvermindDL1

  • Anti-Spam Moderator
  • Hero Member
  • ****
  • Posts: 855
  • Karma: 40
  • Programmer
    • View Profile
    • OvermindDL1's Site
Re: UDT congestion control algorithm implemented
« Reply #37 on: October 17, 2010, 07:04:33 PM »
Well I was thinking of using RakNet for the server<->client communication across the 'net anyway, I am still using UDT between servers since that is what Sector/Sphere uses natively.

Rak'kar

  • Administrator
  • Hero Member
  • *****
  • Posts: 6895
  • Karma: 291
    • View Profile
    • RakNet
Re: UDT congestion control algorithm implemented
« Reply #38 on: November 05, 2010, 09:11:00 AM »
You are correct. If you already have data in the payload that lets you determine the order, then it is not necessary to ask RakNet to do it for you.