Old SQL Server 2005 merge replication bug, how to determine version on the subscribers and details

I got the following email from devx.com, unfortunately when I replied to this colleague, the email bounced. I’m hoping he can Google this up.

Hi Lizet,

I just read your article on www.devx.com about data loss in Merge Replication. Unfortunately we are experiencing this exact same bug. I have already upgraded the subscriber server to SP3. Do you know if the fix for this issue is included in SQL Server Express SP3 and will it get applied if I just upgrade an existing instance from SP2 to SP3? Or do I need to do a complete uninstall and reinstall? Also, is there anything I can do to verify the subscriber instance had the correct patches applied to it? I appreciate your help on this and the article you wrote. We have a lot of work ahead of us restoring data lost but it could have been worse if not for this article.

I appreciate your time.

My replies:

As far as I can remember any SQL Server Engine (whether Express or Standard or Enterprise) had the problem.
The publisher and the distributor (which can only be a Standard or an Express instance) should be patched as well. Any engine with a version lower than 9.00.3228.00 should be patched, whether applying SP3 or only the Cumulative Update they launched after the replication problem came to the public light.
You can check the version at the subscribers using sqlcmd.
If your subscriber engines are installed as the default engine and use windows authentication, you can connect to them using the following command, you need to be an administrator on the machine in order to apply the CU:
sqlcmd -E
checking the version on the sqlcmd command prompt would be:
> select @@version
>GO
I remember patching the subscribers was a pain, as we didn’t want to push the update automatically and we connected to every single subscriber remotely to make sure the patch (CU6 for SP2 in our case) was applied properly.

I added a post to my blog with the line of code that caused the mess. The link is
Hope this helps, good luck

and

I forgot to answer two of your questions:
Do you know if the fix for this issue is included in SQL Server Express SP3
Yes, the fix is included in SP3. Make sure you patch the publisher and distributor as well, not only the subscribers.
and will it get applied if I just upgrade an existing instance from SP2 to SP3?
Yes.
Or do I need to do a complete uninstall and reinstall?
No, just patch.

Usually you can recover the published database from a previous backup but as you usually don’t keep synchronized backups of each subscriber, you have to recreate the publication and subscriptions, potentially destroying any data at the subscribers that has not been merged yet. Or you could copy the subscriber database to a different location, recreate publication and subscription and manually add the data that didn’t merge from your saved database to the newly merged database.

Hopefully this reaches you, again good luck recovering the data.

If you delete a data partition on a merge repl, make sure you do this additional clean up

We wanted to delete old data partition on our merge publication. The Management Studio has a nice UI to view the data partitions and generate the dynamic snapshot per partition, also to delete the partitions you no longer need.

Be warned that this nice UI delete button won’t delete the partition folder at the distributor nor will delete the dynamic snapshot job at the distributor. If you try to add a partition with the same filter token, it will fail.

In order to have a fresh start for that partition value you should:

1. Verify that the subscriptions that use that data partition don’t exist anymore.
2. Delete the data partition using the Publication Properties UI
3. Delete the data partition using the sp_dropmergepartition stored procedure:

use [Published_Database]
exec sp_dropmergepartition @publication = N’Publication_Name’, @suser_sname = N”, @host_name = N’the string we use to filter the data part’
GO

4. Delete the replication partition folder manually at \\distributor\repldata

Happy data merging!

PS SQL Server version 9.00.3228

More on merge replication with subscription expiring after a period of time

We have a merge replication topology with pull subscriptions. We have a setting to expire the subscriptions that haven’t synchronized in the past X days. This setting was mainly due to optimizations, when you don’t expire your subscriptions the amount of metadata to be used grows and grows and your merge process suffers.

The drawback on this setting is that it also makes the snapshot obsolete after X days for any new or reinitialized subscription.

When you add a new subscription to your publication and the snapshot was generated X-1 days ago you will have the following error:

The snapshot for this publication has become obsolete. The snapshot agent needs to be run again before the subscription can be synchronized.

At first we wondered why we got this error when the rest of the subscriptions were working just fine and also wondered what impact would have on the existing subscriptions to regenerate the snapshot.

The answer is: no impact.

We got the explanation from one of Hilary Cotter‘s replies:

The snapshot contains all the data, schema and metadata required to create
all the objects on the subscriber. After one day in your case the metadata
on the publisher is purged and what is in the snapshot will not be
sufficient to sync with the publisher, hence you need a new snapshot.

You want to set a small number so that not a lot of data goes across the
wire, but a big enough number so that the majority of your subscribers will
sync within the expiration time. If you set it to an infinite amount – never
expires, a lot of data will have to go to the subscriber to get it back in
sync.

And another reply with further clarifications:

The answer lies in the MSmerge_contents and MSmerge_genhistory tables.
These two tables hold the list of data changes that happened for the
past x days, x being the subscription expiration days. After x days the
record of the data change expire from the MSmerge_contents table. The
implication of that is that existing subscriptions that have not
synchronised for the past x days will then not be able to merge that
change anymore. The same holds true for creating new subscriptions with
an old snapshot – remember the snapshot also contains data. If the
snapshot was created x-2 days ago you will missing two days of data
changes that has already expired from the MSmerge_contents table.

and

a thread on MSDN:
http://forums.microsoft.com/MSDN/ShowPost.aspx?PostID=945801&SiteID=1

This merge replication error: The merge process could not connect to the Publisher ‘Server:database’ is so misleading

We have a merge topology in place with pull subscriptions, this is the merge agent runs at the subscribers.
One of our subscriptions in had the error that the merge process couldn’t find the server. The server was there and the ping was fine, also the Replication Monitor was able to register the error with the x mark.

The details of the ertor are as follows:

Command attempted:

{call sp_MSensure_single_instance (N’Merge Agent Name’, 4)}

Error messages:

The merge process could not connect to the Publisher ‘Server:database’. Check to ensure that the server is running. (Source: MSSQL_REPL, Error number: MSSQL_REPL-2147199368)
Get help: http://help/MSSQL_REPL-2147199368

Another merge agent for the subscription(s) is running or the server is working on a previous request by the same agent. (Source: MSSQLServer, Error number: 21036)
Get help: http://help/21036

Our error was due to the second cause. It seems the subscriber had lost power while replicating, and any replication after that could not acquire a lock for the merge agent. We restarted the subscriber machine, reinitialized the subscription, with no luck. Only when we dropped the subscription and recreated it again the subscriber was able to run the replication agent again. Just a curious note for the future as this error is not well documented. The only MSDN forum thread that deals with is is still unanswered here…

Lets rectify this error (was Compressed snapshot check makes it fail on one subscriber)


I published a post with the wrong title. It was Compressed snapshot check makes it fail on one subscriber

However the file dynsnapvalidation.tok appears inside the partitioned snapshot whether you select snapshot compression or not. This .tok file shows up if you create data partitions per parameterized filter.

Compressed snapshot check makes it fail on one subscriber

As this blog has my personal bread crumbs I thought I would better record this.

We have problems with our merge replication topology (SQL Server 2005 9.0.3228 Publisher and Dist and SQLE 9.0.3228 as Subscribers, pull subscriptions) the other day as we switched to compressed snapshots. I posted the problem in the forum and searched the books about it without luck…

Here’s the problem

We noticed one subscriber downloaded the compressed dynamic snapshot, but was unable to apply it:

The trace at the subscriber showed the following error:

Partitioned snapshot validation failed for this Subscriber. The snapshot validation token stored in the specified partitioned snapshot location does not match the value ”b422311” used by the Merge Agent when evaluating the parameterized filter function. If specifying the location of the partitioned snapshot (using –DynamicSnapshotLocation), you must ensure that the snapshot files in that directory belong to the correct partition or allow the Merge Agent to automatically detect the location.’,1,1,N’
SQL Merge Agent encountered an error.

The merge agent log with OutputVerbose level 3 showed the following:

2008-06-03 18:07:48.222 [0%] [978 sec remaining] Snapshot will be applied from a compressed cabinet file
2008-06-03 18:07:48.238 OLE DB Distributor ‘TORDISTRIBUTOR‘: {call sys.sp_MSadd_merge_history90 (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)}
2008-06-03 18:07:48.254 OLE DB Subscriber ‘TORSUBSCRIBER‘: {call sys.sp_MSadd_merge_history90 (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)}
2008-06-03 18:07:48.285 [0%] [978 sec remaining] A dynamic snapshot will be applied from ‘\\TORDISTRIBUTOR\repldata\unc\TORPUBLISHER_PUBLISHER_ PUBLICATION\B422311_1\’
2008-06-03 18:07:48.285 OLE DB Distributor ‘TORDISTRIBUTOR‘: {call sys.sp_MSadd_merge_history90 (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)}
2008-06-03 18:07:48.300 OLE DB Subscriber ‘TORSUBSCRIBER‘: {call sys.sp_MSadd_merge_history90 (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)}
2008-06-03 18:07:48.316 [0%] [732 sec remaining] Validating dynamic snapshot
2008-06-03 18:07:48.316 OLE DB Distributor ‘TORDISTRIBUTOR‘: {call sys.sp_MSadd_merge_history90 (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)}
2008-06-03 18:07:48.332 OLE DB Subscriber ‘TORSUBSCRIBER‘: {call sys.sp_MSadd_merge_history90 (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)}
2008-06-03 18:07:48.347 [0%] [732 sec remaining] Extracting snapshot file ‘dynsnapvalidation.tok‘ from cabinet file
2008-06-03 18:07:48.363 OLE DB Distributor ‘TORDISTRIBUTOR‘: {call sys.sp_MSadd_merge_history90 (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)}
2008-06-03 18:07:48.534 OLE DB Subscriber ‘TORSUBSCRIBER‘: {call sys.sp_MSadd_merge_history90 (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)}
2008-06-03 18:07:48.566 [0%] [732 sec remaining] Extracted file ‘dynsnapvalidation.tok
2008-06-03 18:07:48.566 OLE DB Distributor ‘TORDISTRIBUTOR‘: {call sys.sp_MSadd_merge_history90 (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)}
2008-06-03 18:07:48.566 OLE DB Subscriber ‘TORSUBSCRIBER‘: sys.sp_MSregisterdynsnapseqno @snapshot_session_token = N’\\TORDISTRIBUTOR\ReplData\unc\TORPUBLISHER_PUBLISHER_ PUBLICATION\20080527113713\dynsnap‘, @dynsnapseqno = ‘E69C5BB3-1095-429C-92BC-46747E49A155’
2008-06-03 18:07:48.675 OLE DB Subscriber ‘TORSUBSCRIBER‘: sp_MSreleasesnapshotdeliverysessionlock
2008-06-03 18:07:48.753 The merge process was unable to deliver the snapshot to the Subscriber. If using Web synchronization, the merge process may have been unable to create or write to the message file. When troubleshooting, restart the synchronization with verbose history logging and specify an output file to which to write.
2008-06-03 18:07:48.753 OLE DB Subscriber ‘TORSUBSCRIBER‘: {call sys.sp_MSadd_merge_history90 (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)}
2008-06-03 18:07:48.862 The merge process was unable to deliver the snapshot to the Subscriber. If using Web synchronization, the merge process may have been unable to create or write to the message file. When troubleshooting, restart the synchronization with verbose history logging and specify an output file to which to write.

It seems that when we use compressed snapshots and we validate the subscriptions using the HOST_NAME() there is an extra check done by the sync agent. The sync agent checks a token file inside the snapshot.cab file for that user. The filename is dynsnapvalidation.tok and contains the HOST_NAME() in our case.

The HOST_NAME() the subscriber passed to the agent was lower cased, and the partition, token file and sync agent job and token file had the HOST_NAME() upper cased.

Partitioned snapshot validation failed for this Subscriber. The snapshot validation token stored in the specified partitioned snapshot location does not match the value ”b422311” used by the Merge Agent when evaluating the parameterized filter function. If specifying the location of the partitioned snapshot (using –DynamicSnapshotLocation), you must ensure that the snapshot files in that directory belong to the correct partition or allow the Merge Agent to automatically detect the location.’,1,1,N’

SQL Merge Agent encountered an error.

Here’s the solution:

The way we solved the issue by changing the reference data to be lower cased, deleted the snapshot folder, deleted the sync agent job at the distributor for that partition. Dropped the subscription and created it again and it recreated the sync agent job at the distributor with the proper lower cased Host_Name().

We haven’t been able to reproduce it again, but time permitting, we will :-p

I tried to reproduce the scenario using a publisher and subscriber on the same server (SQL Server 2005 Developer build 9.00.3228) but couldn’t reproduce the error. On our real life scenario we use SQL Server Express as subscriber and RMO to trigger replication.

The fix for batch deletions at Publisher when use_partition_groups is set to false In Merge Replication is indeed there!!!

Yay!

The fix for this bug is here. Microsoft published Cumulative Update 6 for SQL Server 2005 SP3. Please see previous post for the original bug description. We had problems when not using partition groups on our merge replication topology (see the TSQL Scripts below). When there were batch deletions caused by filtering at the Publisher, those rows were deleted at the publisher (published database) during the next replication (synchronization with subscribers)

The faulty stored procedure was sp_MSdelsubrowsbatch

The faulty line of code, believe it or not, was a single line
on the faulty SPROC and can be seen below in bold:

— change tombstone type for the non-original rows (the rows that got deleted via expansion).
update dbo.MSmerge_tombstone
set type = 1
from
(select distinct tablenick, rowguid from
#notbelong where original_row <> 1) nb,
dbo.MSmerge_tombstone ts
where ts.tablenick = nb.tablenick and
ts.rowguid = nb.rowguid
option (FORCE ORDER, LOOP JOIN)

On the sproc after CU6:
update dbo.MSmerge_tombstone
set type = @METADATA_TYPE_PartialDelete
from
(select distinct tablenick, rowguid from
#notbelong where original_row <> 1) nb,
dbo.MSmerge_tombstone ts
where ts.tablenick = nb.tablenick and
ts.rowguid = nb.rowguid
option (FORCE ORDER, LOOP JOIN)


Basically the deletions caused by filtering (Partial deletes) were marked as tombstone or user deletions.

Given the following variable declarations:

declare @METADATA_TYPE_Tombstone tinyint
declare @METADATA_TYPE_PartialDelete tinyint
declare @METADATA_TYPE_SystemDelete tinyint

and assignation:

set @METADATA_TYPE_Tombstone= 1
set @METADATA_TYPE_PartialDelete= 5
set @METADATA_TYPE_SystemDelete= 6

 

This caused deletions in the published database. Each subscriber had a subset of the published database. The merge replication was created with filters to allow having at one subscriber only the data pertaining to that subscriber and not the data pertaining to any other subscriber.

If you ever doubted the power of a single line of code… :-p

Happy coding guys!

 

Cumulative Update Number 6 for SQL Server 2005 SP2 is out!!!

Hi DBAs and Database Developers out there,
Finally the long awaited CU#6 was released on Feb 18th near mid night.

946608 Cumulative update package 6 for SQL Server 2005 Service Pack 2
http://support.microsoft.com/default.aspx?scid=kb;EN-US;946608

I’m currently testing our merge replication issues with this CU.

Even though the bug described in the DevX article is not in the KB article list, we got confirmation that the fix is on the CU#6.

So far the identity automatic management problem described on my previous post remains the same. If the user making the inserts is not a db_owner, automatic identity management ain’t going to happen on your publisher…

Identity Range not working for Merge Replication in SQL Server 2005

Back in Sept 2007 I blogged about the problem we were having with the identity range in merge replication:@@identity not working after SQL Server 2005 upgrade The problem continued until today, this post is to explain what we figured out.

The error message that describes this problem reads as follows:

[548] The insert failed. It conflicted with an identity range check constraint in database ‘DatabaseName’, replicated table ‘dbo.TableName’, column ‘ColumnNameId’. If the identity column is automatically managed by replication, update the range as follows: for the Publisher, execute sp_adjustpublisheridentityrange; for the Subscriber, run the Distribution Agent or the Merge Agent.

The identity range adjustment happens after every insert in the given article. The code responsible for the identity check adjustment is on the system trigger for the published article:
MSmerge_isn_GUID where GUID is the GUID for the given article.

if is_member(‘db_owner’) = 1
begin
— select the range values from the MSmerge_identity_range table
— this can be hardcoded if performance is a problem
declare @range_begin numeric(38,0)
declare @range_end numeric(38,0)
declare @next_range_begin numeric(38,0)
declare @next_range_end numeric(38,0)

select @range_begin = range_begin,
@range_end = range_end,
@next_range_begin = next_range_begin,
@next_range_end = next_range_end
from dbo.MSmerge_identity_range where artid=’BAEF9398-B1B1-4A68-90A4-602E3383F74A’ and subid=’0F9826DB-50FB-4F73-844D-AE3A111B4E1C’ and is_pub_range=0

if @range_begin is not null and @range_end is not NULL and @next_range_begin is not null and @next_range_end is not NULL
begin
if IDENT_CURRENT(‘[dbo].[TableName]’) = @range_end
begin
DBCC CHECKIDENT (‘[dbo].[TableName]’, RESEED, @next_range_begin) with no_infomsgs
end
else if IDENT_CURRENT(‘[dbo].[TableName]’) >= @next_range_end
begin
exec sys.sp_MSrefresh_publisher_idrange ‘[dbo].[TableName]’, ‘0F9826DB-50FB-4F73-844D-AE3A111B4E1C’, ‘BAEF9398-B1B1-4A68-90A4-602E3383F74A’, 2, 1
if @@error<>0 or @retcode<>0
goto FAILURE
end
end
end

As you might have noticed already, if the insertion is made by a user that is not member of the db_owner database role, the identity adjustment won’t happen. I believe this is a bug, not a feature. It forces the users that are allowed to do insertions to be owners of the database, or they will have the run out of identities error quite often and a manual adjustment will be required.

What the savvy DBA can do:

The template for all of the MSmerge_ins_GUID triggers are on the model database, these triggers are created by the system stored procedure: sp_MSaddmergetriggers_internal

If you change the code on this system stored procedure any new database creation will have the proper template for creating the merge insertion triggers. For the existing databases, you might like to modify the MSmerge_ins_GUID trigger manually, for now.

At the time I’m writing this post I’m not aware if this issue is addressed on any cumulative update. There is a similar case on the Cumulative Update released on January 15, 2008
http://support.microsoft.com/kb/941989

Happy coding guys!

Merge Replication parameterized row filtering, beware of the optimizations

I haven’t blogged in a while, mostly due to my workload that has sky rocketed.
My most recent challenge is with SQL Server 2005 Merge Replication.

We have had quite a few undesirable effects when we upgraded our subscribers from ole good MSDE to SQL Server Express.
The side effects led to data loss and emergency fixes, in a really chaotic way. This data loss didn’t happen with MSDE.

The article I recently published on DevX described a data loss scenario with SQL Server 2005 Ent as publisher/distributor and MSDE or SQLE as subscriber.

The scenario we got recently happens only with SQLE in the subscribers and is related to the parameterized filter optimizations in SQL Server 2005.

Tweaking the parameters @use_partition_groups and @keep_partition_changes produce different side effects that go from data loss to inability to merge the data, the parameterized filters stop working.

If you’re interested on the subject hang on, I will publish repros in the next post.

If anyone knows how to set up the Windows Synchronization Manager for Verbose History Logging to output to a file, please comment here, all my attempts have failed.

Cheers and happy holidays!