[Dnsmasq-discuss] dnsmasq 2.91test9 + TCP + stale cache

Sat Feb 1 10:12:01 UTC 2025

On 1/31/25 23:20, Simon Kelley wrote:
> 
> 
> On 1/31/25 16:19, Dominik Derigs wrote:
>> Hey Simon,
>>
>> we have found another (small) thing. The requirements for 
>> reintroducing are:
>>
>> 1. Using --use-stale-cache
>> 2. A query is received via TCP
>> 3. The cached record is stale
>>
>> Querying the stale record causes the query to be "refreshed". However, 
>> when, at the same time, the client disconnects (and the TCP fork exits 
>> accordingly), the received reply will never be received and find its 
>> way in the mother process's cache.
>>
>> Could we postpone the shutdown of TCP forks in case a refreshment 
>> query is still ongoing?
>>
> 
> That's not what happens, or at least that's not what is supposed to happen.
> 
> The control flow in tcp_request() when stale data is found in the cache 
> goes as follows.
> 
> 1) read query from TCP client connection.
> 2) Lookup in cache and get stale answer.
> 3) Return stale answer to requestor over TCP connection.
> 4) Close client TCP connection server-side. This forces the client to 
> open another connection and create a new process if it has more queries. 
> The reason for this is that the existing process now
> 5) sends the query upstream and blocks awaiting the answer.
> 6) receives the answer and caches it in the local process This also 
> serialises the answer into the pipe to the parent process.
> 7) return from tcp_request() and the process exits.
> 
> At this point the data has either been read from the pipe by the parent 
> process and inserted into its cache, or it is still in the pipe buffer 
> and will shortly be read. This is why the process-management code in 
> dnsmasq.c doesn't free a process slot until _both_ the process has gone 
> and the pipe has returned EOF and been closed.
> 
> Pipes don't disappear and lose data until both ends have been closed. If 
> the write end is closed but there is queued data, then the read end can 
> still read the queued data.
> 
> The control flow in tcp_request() is hard to follow, since it repeats 
> the loop both to read another query and the get an answer to it, and to 
> get an answer to a query which has already been answered and send that 
> back to mother.
> 
> The pseudocode which describes what a TCP child process does looks like 
> this.
> 
> do-stale = FALSE
> 
> while (1)
> {
>     have_answer = FALSE
> 
>     if (!do_stale)
>       {
>         read_query()
> 
>         if (no_query_client_connection_closed())
>            exit();
> 
>         if (query_in_cache())
>            have_answer = TRUE
>       }
> 
>     if (!have_answer)
>       {
>          send_query_upstream()
>      get_answer_from_upstream()
>      insert_answer_into_cache_and_pipe_to_parent()
>       }
> 
>     if (do_stale)
>        exit();
> 
>     return_answer()
> 
>     if (answer_was_stale())
>        {
>           do_stale = TRUE;
>           close_client_connection()
>        }
> }
> 
> The client disconnecting doesn't cause the process to exit before the 
> new data has been pushed into the pipe. If you can demonstrate that it 
> does, that's a bug.
> 
> 
> 

My sleeping brain had a thought. Because the TCP-handler process doesn't 
go through the poll() loop, it has to call check_log_writer() explicitly 
to make logging happen - the call is there in tcp_request(). It looks 
possible that in the stale-cache path, there is no call to 
check_log_writer() after the new data gets cached (and logged) and the 
process terminating. That would cause a failure to log the repies even 
though they do end up in the parent processes cache, and may be what you 
are seeing.

I shall look at this later, other things to do now.

Cheers,

Simon.